SPEC: Synthesize, Diagnose, and Optimize: Towards Fine-Grained Vision-Language Understanding

Wujian Peng, Sicheng Xie, Zuyao You, Shiyi Lan, Zuxuan Wu^†,

^† Corresponding author

🔥 News

Jun. 17, 2025 🔥 We have released the checkpoints of our fine-tuned model.

Apr. 13, 2024 We released the SPEC dataset and the code for evaluation, sorry for the delay ☺️.
Feb. 28, 2024 Our work has been accepted by CVPR 2024 🎉.

🔍 SPEC Benchmark

To evaluate the understanding capability of visual-language models on fine-grained concepts, we propose a new benchmark, SPEC, which consists of six distinct subsets, distributed across the dimensions of Size, Position, Existence, and Count. Each test case consists of an image candidate set, which differs only in certain visual concepts, and a text candidate set, which differs only in the corresponding language concept.

🔧 Usage

install

git clone https://github.com/wjpoom/SPEC.git
cd SPEC/
pip install -e .

prepare data

run the following code in Python shell, replace /path/to/save/data with a specified dir to store the data.

import zipfile
import os
from huggingface_hub import hf_hub_download

data_root = '/path/to/save/data'
hf_hub_download(repo_id='wjpoom/SPEC', repo_type='dataset', filename='data.zip', local_dir=data_root)

with zipfile.ZipFile(os.path.join(data_root, 'data.zip'), 'r') as zip_ref:
    zip_ref.extractall(os.path.join(data_root))
    
os.remove(os.path.join(data_root, 'data.zip'))

explore the dataset

We provide a 📓notebook that enables you to visually explore the test samples in the SPEC dataset.
Run this notebook either locally or online using Colab.

reproduce the results

In our paper, we evaluated four popular VLMs using our SPEC dataset, namely: CLIP, BLIP, FLAVA and CoCa.
To reproduce the results with these VLMs, you can run this script.
You can also reproduce with this local notebook or the online Colab notebook.

evaluate custom VLMs

If you want to evaluate your custom model on SPEC, you can follow the instructions in this document.

👾 Model weights

pip install open_clip_torch
mkdir checkpoints
huggingface-cli download wjpoom/SPEC-CLIP-ViT-B-32 --local-dir checkpoints/SPEC-CLIP-ViT-B-32

import torch
from PIL import Image
import open_clip

model, _, preprocess = open_clip.create_model_and_transforms('ViT-B-32', pretrained='checkpoints/SPEC-CLIP-ViT-B-32', load_weights_only=False)
model.eval()
tokenizer = open_clip.get_tokenizer('ViT-B-32')

image = preprocess(Image.open("assets/image.png")).unsqueeze(0)
text = tokenizer([
    "the broccoli is situated above the backpack.", 
    "the broccoli is situated to the right of the backpack",
    "the broccoli is positioned on the left of the backpack.",
    "the broccoli is placed beneath the backpack."
    ])

with torch.no_grad(), torch.autocast("cuda"):
    image_features = model.encode_image(image)
    text_features = model.encode_text(text)
    image_features /= image_features.norm(dim=-1, keepdim=True)
    text_features /= text_features.norm(dim=-1, keepdim=True)

    text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1)

print("Label probs:", text_probs)

📝 TODO

Release the checkpoint of fine-tuned model
Release the testing set of SPEC benchmark
Release the evaluation code of SPEC

👏 Acknowledgement

Part of this repository is built upon ARO, thanks for the well-organized codebase.

Contact Us

Feel free to contact us if you have any questions or suggestions

Email (Wujian Peng): wjpeng24@m.fudan.edu.cn

✒️ Citation

If you use our code or data in this repo or find our work helpful, please consider giving a citation:

@inproceedings{peng2024synthesize,
  title={Synthesize diagnose and optimize: Towards fine-grained vision-language understanding},
  author={Peng, Wujian and Xie, Sicheng and You, Zuyao and Lan, Shiyi and Wu, Zuxuan},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  pages={13279--13288},
  year={2024}
}

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
assets		assets
build/lib/spec		build/lib/spec
dist		dist
docs		docs
notebooks		notebooks
spec.egg-info		spec.egg-info
spec		spec
.gitignore		.gitignore
README.md		README.md
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

SPEC: Synthesize, Diagnose, and Optimize: Towards Fine-Grained Vision-Language Understanding

🔥 News

🔍 SPEC Benchmark

🔧 Usage

install

prepare data

explore the dataset

reproduce the results

evaluate custom VLMs

👾 Model weights

📝 TODO

👏 Acknowledgement

Contact Us

✒️ Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

wjpoom/SPEC

Folders and files

Latest commit

History

Repository files navigation

SPEC: Synthesize, Diagnose, and Optimize: Towards Fine-Grained Vision-Language Understanding

🔥 News

🔍 SPEC Benchmark

🔧 Usage

install

prepare data

explore the dataset

reproduce the results

evaluate custom VLMs

👾 Model weights

📝 TODO

👏 Acknowledgement

Contact Us

✒️ Citation

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages