![]() |
---|
[Arxiv link] |
Getting Started • Usage • Benchmarks & Models • Credit & Citation
This repository is designed to simplify the evaluation process of vision-language models. It provides a comprehensive set of tools and scripts for evaluating VLM models and benchmarks. We offer 60+ VLMs, inclusive of recent large-scale models like EVACLIP, with scales reaching up to 4.3B parameters and 12.8B training samples. Additionally, we provide implementations for 40+ evaluation benchmarks.
For the latest news and updates, see the snippet below.
- Removed FaceNet from required libraries.
- Added SigLIP2 models
- Added bivlc benchmark
- Created benchmark_builder for future benchmark implementations
- Added News & Updates section in README
- Fixed Sun397 benchmark
For full details, refer to the UPDATES.md file.
- L-VLM (e.g. PaliGemma, LlavaNext)
Install the package:
pip install unibench -U
[option 2] Install Dependencies
- Install the necessary dependencies by:
- Option 1, creating a new conda env:
conda env create -f environment.yml
- Option 2, updating your conda env with required libraries:
conda env update --file environment.yml --prune
- Option 1, creating a new conda env:
- Activate the environment:
conda activate unibench
- Install Spacy english language model:
python -m spacy download en_core_web_sm
- Install the package:
pip install git+https://github.com/facebookresearch/unibench
The following command will print the results of the evaluations on all benchmarks and models:
unibench show_results
The following command will run the evaluation on all benchmarks and models:
unibench evaluate
The following command will run the evaluation on all benchmarks and models:
import unibench as vlm
evaluator = vlm.Evaluator()
evaluator.evaluate()
evaluate
function takes the following arguments:
Args:
save_freq (int): The frequency at which to save results. Defaults to 1000.
face_blur (bool): Whether to use face blurring during evaluation. Defaults to False.
device (str): The device to use for evaluation. Defaults to "cuda" if available otherwise "cpu".
batch_per_gpu (int): Evaluation batch size per GPU. Defaults to 32.
The Evaluator
class takes the following arguments:
Args:
seed (int): Random seed for reproducibility.
num_workers (int): Number of workers for data loading.
models (Union[List[str], str]): List of models to evaluate or "all" to evaluate all available models.
benchmarks (Union[List[str], str]): List of benchmarks to evaluate or "all" to evaluate all available benchmarks.
model_id (Union[int, None]): Specific model ID to evaluate.
benchmark_id (Union[int, None]): Specific benchmark ID to evaluate.
output_dir (str): Directory to save evaluation results.
benchmarks_dir (str): Directory containing benchmark data.
download_aggregate_precomputed (bool): Whether to download aggregate precomputed results.
download_all_precomputed (bool): Whether to download all precomputed results.
The following command will run the evaluation for openclip_vitB32 trained on metaclip400m and CLIP ResNet50 on vg_relation,clevr_distance,pcam,imageneta benchmarks:
unibench evaluate --models=[openclip_vitB32_metaclip_400m,clip_resnet50] --benchmarks=[vg_relation,clevr_distance,pcam,imageneta]
In addition to saving the results in ~/.cache/unibench
, the output would be a summary of the evaluation results:
model_name non-natural images reasoning relation robustness
────────────────────────────────────────────────────────────────────────────────────────
clip_resnet50 63.95 14.89 54.13 23.27
openclip_vitB32_metaclip_400m 63.87 19.46 51.54 28.71
Full list of models and benchmarks are available in the models_zoo and benchmarks_zoo. You are also able to run the following commands:
unibench list_models
# or
unibench list_benchmarks
Dataset Size (Million) | Number of Parameters (Million) | Learning Objective | Architecture | Model Name | |
---|---|---|---|---|---|
blip_vitB16_14m | 14 | 86 | BLIP | vit | BLIP ViT B 16 |
blip_vitL16_129m | 129 | 307 | BLIP | vit | BLIP ViT L 16 |
blip_vitB16_129m | 129 | 86 | BLIP | vit | BLIP ViT B 16 |
blip_vitB16_coco | 129 | 86 | BLIP | vit | BLIP ViT B 16 |
blip_vitB16_flickr | 129 | 86 | BLIP | vit | BLIP ViT B 16 |
benchmark | benchmark_type | |
---|---|---|
clevr_distance | zero-shot | vtab |
fgvc_aircraft | zero-shot | transfer |
objectnet | zero-shot | robustness |
winoground | relation | relation |
imagenetc | zero-shot | corruption |
benchmark type | number of benchmarks |
---|---|
ImageNet | 1 |
vtab | 18 |
transfer | 7 |
robustness | 6 |
relation | 6 |
corruption | 1 |
For each model, the results are saved in the output directory defined in constants: ~./.cache/unibench/outputs
.
To add new benchmark, you can simply inherit from the torch.utils.data.Dataset
class and implement the __getitem__
, and __len__
methods. For example, here is how to add ImageNetA as a new benchmark:
from functools import partial
from unibench import Evaluator
from unibench.benchmarks_zoo import ZeroShotBenchmarkHandler
from torchvision.datasets import FashionMNIST
class_names = [
"T-shirt/top",
"Trouser",
"Pullover",
"Dress",
"Coat",
"Sandal",
"Shirt",
"Sneaker",
"Bag",
"Ankle boot",
]
templates = ["an image of {}"]
benchmark = partial(
FashionMNIST, root="/fsx-robust/haideraltahan", train=False, download=True
)
handler = partial(
ZeroShotBenchmarkHandler,
benchmark_name="fashion_mnist_new",
classes=class_names,
templates=templates,
)
eval = Evaluator()
eval.add_benchmark(
benchmark,
handler,
meta_data={
"benchmark_type": "object recognition",
},
)
eval.update_benchmark_list(["fashion_mnist_new"])
eval.update_model_list(["blip_vitB16_129m"])
eval.evaluate()
The most important compontent of adding a new model is creating or using pre-existing AbstractModel
and implementing compute_zeroshot_weights
, get_image_embeddings
, and get_text_embeddings
, similar to how ClipModel
works:
class ClipModel(AbstractModel):
def __init__(
self,
model,
model_name,
**kwargs,
):
super(ClipModel, self).__init__(model, model_name, **kwargs)
def compute_zeroshot_weights(self):
zeroshot_weights = []
for class_name in self.classes:
texts = [template.format(class_name) for template in self.templates]
class_embedding = self.get_text_embeddings(texts)
class_embedding = class_embedding.mean(dim=0)
class_embedding /= class_embedding.norm(dim=-1, keepdim=True)
zeroshot_weights.append(class_embedding)
self.zeroshot_weights = torch.stack(zeroshot_weights).T
@torch.no_grad()
def get_image_embeddings(self, images):
image_features = self.model.encode_image(images.to(self.device))
image_features /= image_features.norm(dim=1, keepdim=True)
return image_features.unsqueeze(1)
@torch.no_grad()
def get_text_embeddings(self, captions):
if (
"truncate" in inspect.getfullargspec(self.tokenizer.__call__)[0]
or "truncate" in inspect.getfullargspec(self.tokenizer)[0]
):
caption_tokens = self.tokenizer(
captions, context_length=self.context_length, truncate=True
).to(self.device)
else:
caption_tokens = self.tokenizer(
captions, context_length=self.context_length
).to(self.device)
caption_embeddings = self.model.encode_text(caption_tokens)
caption_embeddings /= caption_embeddings.norm(dim=-1, keepdim=True)
return caption_embeddings
Using the following class, we can then add models to the list of models. Here we have an example of adding and evaluating ViTamin-L
.
from functools import partial
from io import open_code
from unibench import Evaluator
from unibench.models_zoo.wrappers.clip import ClipModel
import open_clip
model, _, _ = open_clip.create_model_and_transforms(
"ViTamin-L", pretrained="datacomp1b"
)
tokenizer = open_clip.get_tokenizer("ViTamin-L")
model = partial(
ClipModel,
model=model,
model_name="vitamin_l_comp1b",
tokenizer=tokenizer,
input_resolution=model.visual.image_size[0],
logit_scale=model.logit_scale,
)
eval = Evaluator(benchmarks_dir="/fsx-checkpoints/haideraltahan/.cache/unibench/data")
eval.add_model(model=model)
eval.update_benchmark_list(["imagenet1k"])
eval.update_model_list(["vitamin_l_comp1b"])
eval.evaluate()
Contributions (e.g. adding new benchmarks/models), issues, and feature requests are welcome! For any changes, please open an issue first to discuss what you would like to change or improve.
When contributing please ensure tests are passing:
# if need be, pip install pytest
python -m pytest tests/
The majority of UniBench is licensed under CC-BY-NC, however portions of the project are available under separate license terms:
License | Libraries |
---|---|
MIT license | zipp, tabulate, rich, openai-clip, latextable, gdown |
Apache 2.0 license | transformers, timm, opencv-python, open-clip-torch, ftfy, fire, debtcollector, datasets, oslo.concurrency |
BSD license | torchvision, torch, seaborn, scipy, scikit-learn, fairscale, cycler, contourpy, click, GitPython |
If you use this repository in your research, please cite it as follows:
@inproceedings{altahan2024unibenchvisualreasoningrequires,
title={UniBench: Visual Reasoning Requires Rethinking Vision-Language Beyond Scaling},
author={Haider Al-Tahan and Quentin Garrido and Randall Balestriero and Diane Bouchacourt and Caner Hazirbas and Mark Ibrahim},
year={2024},
eprint={2408.04810},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2408.04810},
}
Library structure was inspired by Robert Geirhos's work https://github.com/bethgelab/model-vs-human