Skip to content

facebookresearch/unibench

Repository files navigation

header
[Arxiv link]

Getting StartedUsageBenchmarks & ModelsCredit & Citation

Vision-Language Model Evaluation Repository

This repository is designed to simplify the evaluation process of vision-language models. It provides a comprehensive set of tools and scripts for evaluating VLM models and benchmarks. We offer 60+ VLMs, inclusive of recent large-scale models like EVACLIP, with scales reaching up to 4.3B parameters and 12.8B training samples. Additionally, we provide implementations for 40+ evaluation benchmarks.

News and Updates

For the latest news and updates, see the snippet below.

April 15, 2025 - v0.4.0

  • Removed FaceNet from required libraries.
  • Added SigLIP2 models
  • Added bivlc benchmark
  • Created benchmark_builder for future benchmark implementations
  • Added News & Updates section in README
  • Fixed Sun397 benchmark

For full details, refer to the UPDATES.md file.

Coming Soon

  • L-VLM (e.g. PaliGemma, LlavaNext)

Getting Started

Install the package:

pip install unibench -U
[option 2] Install Dependencies
  1. Install the necessary dependencies by:
    • Option 1, creating a new conda env: conda env create -f environment.yml
    • Option 2, updating your conda env with required libraries: conda env update --file environment.yml --prune
  2. Activate the environment: conda activate unibench
  3. Install Spacy english language model: python -m spacy download en_core_web_sm
  4. Install the package: pip install git+https://github.com/facebookresearch/unibench

Usage

Print out Results from Evaluated Models

The following command will print the results of the evaluations on all benchmarks and models:

unibench show_results

Run Evaluation using Command Line

The following command will run the evaluation on all benchmarks and models:

unibench evaluate

Run Evaluation using Custom Script

The following command will run the evaluation on all benchmarks and models:

import unibench as vlm

evaluator = vlm.Evaluator()
evaluator.evaluate()

Arguments for Evaluation

evaluate function takes the following arguments:

Args:
    save_freq (int): The frequency at which to save results. Defaults to 1000.
    face_blur (bool): Whether to use face blurring during evaluation. Defaults to False.
    device (str): The device to use for evaluation. Defaults to "cuda" if available otherwise "cpu".
    batch_per_gpu (int): Evaluation batch size per GPU. Defaults to 32.

The Evaluator class takes the following arguments:

Args:
    seed (int): Random seed for reproducibility.
    num_workers (int): Number of workers for data loading.
    models (Union[List[str], str]): List of models to evaluate or "all" to evaluate all available models.
    benchmarks (Union[List[str], str]): List of benchmarks to evaluate or "all" to evaluate all available benchmarks.
    model_id (Union[int, None]): Specific model ID to evaluate.
    benchmark_id (Union[int, None]): Specific benchmark ID to evaluate.
    output_dir (str): Directory to save evaluation results.
    benchmarks_dir (str): Directory containing benchmark data.
    download_aggregate_precomputed (bool): Whether to download aggregate precomputed results.
    download_all_precomputed (bool): Whether to download all precomputed results.

Example

The following command will run the evaluation for openclip_vitB32 trained on metaclip400m and CLIP ResNet50 on vg_relation,clevr_distance,pcam,imageneta benchmarks:

unibench evaluate --models=[openclip_vitB32_metaclip_400m,clip_resnet50] --benchmarks=[vg_relation,clevr_distance,pcam,imageneta]

In addition to saving the results in ~/.cache/unibench, the output would be a summary of the evaluation results:

  model_name                      non-natural images   reasoning   relation   robustness  
 ──────────────────────────────────────────────────────────────────────────────────────── 
  clip_resnet50                   63.95                 14.89       54.13      23.27       
  openclip_vitB32_metaclip_400m   63.87                 19.46       51.54      28.71   

Supported Models and benchmarks

Full list of models and benchmarks are available in the models_zoo and benchmarks_zoo. You are also able to run the following commands:

unibench list_models
# or
unibench list_benchmarks

Sample Models

Dataset Size (Million) Number of Parameters (Million) Learning Objective Architecture Model Name
blip_vitB16_14m 14 86 BLIP vit BLIP ViT B 16
blip_vitL16_129m 129 307 BLIP vit BLIP ViT L 16
blip_vitB16_129m 129 86 BLIP vit BLIP ViT B 16
blip_vitB16_coco 129 86 BLIP vit BLIP ViT B 16
blip_vitB16_flickr 129 86 BLIP vit BLIP ViT B 16

Sample benchmarks

benchmark benchmark_type
clevr_distance zero-shot vtab
fgvc_aircraft zero-shot transfer
objectnet zero-shot robustness
winoground relation relation
imagenetc zero-shot corruption

benchmarks Overview

benchmark type number of benchmarks
ImageNet 1
vtab 18
transfer 7
robustness 6
relation 6
corruption 1

How results are saved

For each model, the results are saved in the output directory defined in constants: ~./.cache/unibench/outputs.

Add new Benchmark

To add new benchmark, you can simply inherit from the torch.utils.data.Dataset class and implement the __getitem__, and __len__ methods. For example, here is how to add ImageNetA as a new benchmark:

from functools import partial
from unibench import Evaluator
from unibench.benchmarks_zoo import ZeroShotBenchmarkHandler
from torchvision.datasets import FashionMNIST

class_names = [
    "T-shirt/top",
    "Trouser",
    "Pullover",
    "Dress",
    "Coat",
    "Sandal",
    "Shirt",
    "Sneaker",
    "Bag",
    "Ankle boot",
]

templates = ["an image of {}"]

benchmark = partial(
    FashionMNIST, root="/fsx-robust/haideraltahan", train=False, download=True
)
handler = partial(
    ZeroShotBenchmarkHandler,
    benchmark_name="fashion_mnist_new",
    classes=class_names,
    templates=templates,
)


eval = Evaluator()

eval.add_benchmark(
    benchmark,
    handler,
    meta_data={
        "benchmark_type": "object recognition",
    },
)
eval.update_benchmark_list(["fashion_mnist_new"])
eval.update_model_list(["blip_vitB16_129m"])
eval.evaluate()

Add new Model

The most important compontent of adding a new model is creating or using pre-existing AbstractModel and implementing compute_zeroshot_weights, get_image_embeddings, and get_text_embeddings, similar to how ClipModel works:

class ClipModel(AbstractModel):
    def __init__(
        self,
        model,
        model_name,
        **kwargs,
    ):
        super(ClipModel, self).__init__(model, model_name, **kwargs)

    def compute_zeroshot_weights(self):
        zeroshot_weights = []
        for class_name in self.classes:
            texts = [template.format(class_name) for template in self.templates]

            class_embedding = self.get_text_embeddings(texts)

            class_embedding = class_embedding.mean(dim=0)
            class_embedding /= class_embedding.norm(dim=-1, keepdim=True)

            zeroshot_weights.append(class_embedding)
        self.zeroshot_weights = torch.stack(zeroshot_weights).T

    @torch.no_grad()
    def get_image_embeddings(self, images):
        image_features = self.model.encode_image(images.to(self.device))
        image_features /= image_features.norm(dim=1, keepdim=True)
        return image_features.unsqueeze(1)

    @torch.no_grad()
    def get_text_embeddings(self, captions):
        if (
            "truncate" in inspect.getfullargspec(self.tokenizer.__call__)[0]
            or "truncate" in inspect.getfullargspec(self.tokenizer)[0]
        ):
            caption_tokens = self.tokenizer(
                captions, context_length=self.context_length, truncate=True
            ).to(self.device)
        else:
            caption_tokens = self.tokenizer(
                captions, context_length=self.context_length
            ).to(self.device)

        caption_embeddings = self.model.encode_text(caption_tokens)
        caption_embeddings /= caption_embeddings.norm(dim=-1, keepdim=True)

        return caption_embeddings

Using the following class, we can then add models to the list of models. Here we have an example of adding and evaluating ViTamin-L.

from functools import partial
from io import open_code
from unibench import Evaluator
from unibench.models_zoo.wrappers.clip import ClipModel
import open_clip

model, _, _ = open_clip.create_model_and_transforms(
    "ViTamin-L", pretrained="datacomp1b"
)

tokenizer = open_clip.get_tokenizer("ViTamin-L")

model = partial(
    ClipModel,
    model=model,
    model_name="vitamin_l_comp1b",
    tokenizer=tokenizer,
    input_resolution=model.visual.image_size[0],
    logit_scale=model.logit_scale,
)


eval = Evaluator(benchmarks_dir="/fsx-checkpoints/haideraltahan/.cache/unibench/data")

eval.add_model(model=model)
eval.update_benchmark_list(["imagenet1k"])
eval.update_model_list(["vitamin_l_comp1b"])
eval.evaluate()

Contributing

Contributions (e.g. adding new benchmarks/models), issues, and feature requests are welcome! For any changes, please open an issue first to discuss what you would like to change or improve.

When contributing please ensure tests are passing:

# if need be, pip install pytest 
python -m pytest tests/

License

The majority of UniBench is licensed under CC-BY-NC, however portions of the project are available under separate license terms:

License Libraries
MIT license zipp, tabulate, rich, openai-clip, latextable, gdown
Apache 2.0 license transformers, timm, opencv-python, open-clip-torch, ftfy, fire, debtcollector, datasets, oslo.concurrency
BSD license torchvision, torch, seaborn, scipy, scikit-learn, fairscale, cycler, contourpy, click, GitPython

Citation

If you use this repository in your research, please cite it as follows:

@inproceedings{altahan2024unibenchvisualreasoningrequires,
      title={UniBench: Visual Reasoning Requires Rethinking Vision-Language Beyond Scaling}, 
      author={Haider Al-Tahan and Quentin Garrido and Randall Balestriero and Diane Bouchacourt and Caner Hazirbas and Mark Ibrahim},
      year={2024},
      eprint={2408.04810},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2408.04810}, 
}

Recognition

Library structure was inspired by Robert Geirhos's work https://github.com/bethgelab/model-vs-human

About

Python Library to evaluate VLM models' robustness across diverse benchmarks

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Packages

No packages published

Contributors 5