After years of advancements in English-centric CLIP development, MetaCLIP 2 is now taking the next step: scaling CLIP to worldwide data. The effort addresses long-standing challenges:
- large-scale non-English data curation pipelines are largely undeveloped;
- the curse of multilinguality, where English performance often degrades in multilingual CLIP compared to English-only CLIP.
With a complete recipe for worldwide CLIP—spanning data curation, modeling, and training—we show that English and non-English worlds can mutually benefit and elevate each other, achieving SoTA multilingual performance.
- 07/29/2025: 🔥 MetaCLIP 2 (worldwide) is released.
- 12/10/2024: 🔥 MetaCLIPv1.2 (ViT-H/14) trained with Altogether synthetic captions is released.
- 10/09/2024: 🔥 Altogether: Image Captioning via Re-aligning Alt-text (aka MetaCLIPv1.2) is accepted by EMNLP 2024 with code released.
- 08/15/2024: v0.1 released.
- 04/25/2024: 🔥 paper MoDE: CLIP Data Experts via Clustering is accepted by CVPR 2024 with code released.
- 01/18/2024: 🔥 add code for building metadata.
- 01/16/2024: 🔥 paper accepted by ICLR as spotlight presentation.
- 12/25/2023: Huggingface Space demo and Colab released.
- 12/21/2023: MetaCLIPv1.1 (ViT-G/14) released.
- 09/28/2023: initial release.
The pre-trained MetaCLIP models are available in
mini_clip (this repo)
import torch
from PIL import Image
from src.mini_clip.factory import create_model_and_transforms, get_tokenizer
model, _, preprocess = create_model_and_transforms('ViT-H-14-quickgelu-worldwide@WorldWideCLIP', pretrained='metaclip2_worldwide')
tokenizer = get_tokenizer("facebook/xlm-v-base")
image = preprocess(Image.open("docs/CLIP.png")).unsqueeze(0)
text = tokenize(["a diagram", "a dog", "a cat"])
with torch.no_grad():
image_features = model.encode_image(image)
text_features = model.encode_text(text)
image_features /= image_features.norm(dim=-1, keepdim=True)
text_features /= text_features.norm(dim=-1, keepdim=True)
text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1)
print("Label probs:", text_probs)
Huggingface (MetaCLIP 1 only)
from PIL import Image
from transformers import AutoProcessor, AutoModel
processor = AutoProcessor.from_pretrained("facebook/metaclip-b32-400m")
model = AutoModel.from_pretrained("facebook/metaclip-b32-400m")
image = Image.open("docs/CLIP.png")
inputs = processor(text=["a diagram", "a dog", "a cat"], images=image, return_tensors="pt", padding=True)
with torch.no_grad():
outputs = model(**inputs)
logits_per_image = outputs.logits_per_image # this is the image-text similarity score
text_probs = logits_per_image.softmax(dim=-1)
print("Label probs:", text_probs)
MetaCLIP closely adhere to OpenAI CLIP training and model setup (you mostly just need to replace the weights): to promote rigorous ablation studies and advance scientific understanding, as in the old "era of ImageNet".
MetaCLIP 2
model_name |
pretrained |
Data Card | # of Seen Pairs | Res. | CVQA-LOCAL ZS Acc. |
---|---|---|---|---|---|
ViT-H-14-quickgelu-worldwide |
metaclip2_worldwide |
Online Curation | 29B | 224 | 57.4 |
ViT-H-14-378-worldwide |
metaclip2_worldwide |
Online Curation | 29B | 378 | 58.2 |
ViT-bigG-14-worldwide |
metaclip2_worldwide |
Online Curation | 29B | 224 | 60.7 |
ViT-bigG-14-378-worldwide |
metaclip2_worldwide |
Online Curation | 29B | 378 | 62.0 |
(WIP): MetaCLIP 2: distilled smaller models and tokenizers.
MetaCLIP 1
model_name |
pretrained |
Data Card | # of Seen Pairs | Res. | GPUs | IN ZS Acc. |
---|---|---|---|---|---|---|
ViT-B-32-quickgelu |
metaclip_400m |
data card | 12.8B | 224 | 64 x V100 | 65.5 |
ViT-B-16-quickgelu |
metaclip_400m |
data card | 12.8B | 224 | 64 x V100 | 70.8 |
ViT-L-14-quickgelu |
metaclip_400m |
data card | 12.8B | 224 | 128 x V100 | 76.2 |
ViT-B-32-quickgelu |
metaclip_2_5b |
data card | 12.8B | 224 | 64 x V100 | 67.6 |
ViT-B-16-quickgelu |
metaclip_2_5b |
data card | 12.8B | 224 | 64 x V100 | 72.1 |
ViT-L-14-quickgelu |
metaclip_2_5b |
data card | 12.8B | 224 | 128 x V100 | 79.2 |
ViT-H-14-quickgelu |
metaclip_2_5b |
data card | 12.8B | 224 | 256 x A100 | 80.5 |
ViT-bigG-14-quickgelu (v1.1) |
metaclip_2_5b |
data card | 12.8B | 224 | 256 x A100 | 82.1 |
ViT-H-14 (v1.2) |
metaclip_v1_2_altogether |
Online Curation | 35B | 224 | 256 x H100 | 82.0 |
This code is customized from OpenCLIP and will be maintained separately for research on MetaCLIP. The following command should install requirements for OpenCLIP and submitit=1.2.1
used by this repo:
conda create -n metaclip python=3.10 pytorch torchvision pytorch-cuda=11.7 tqdm ftfy braceexpand regex pandas submitit=1.2.1 \
-c pytorch-nightly \
-c nvidia \
-c conda-forge \
-c anaconda
See MetaCLIP 2 and MetaCLIP 1.
If you have any questions related to the code or the paper, feel free to email Hu Xu (huxu@meta.com
).
Please cite the following paper if MetaCLIP helps your work:
```bibtex
@inproceedings{Chuang2025metaclip2,
title={MetaCLIP 2: A Worldwide Scaling Recipe},
author={Yung-Sung Chuang, Yang Li, Dong Wang, Ching-Feng Yeh, Kehan Lyu, Ramya Raghavendra, James Glass, Lifei Huang, Jason Weston, Luke Zettlemoyer, Xinlei Chen, Zhuang Liu, Saining Xie, Wen-tau Yih, Shang-Wen Li and Hu Xu},
journal={arXiv preprint arXiv:xxxx.xxxxx},
year={2025}
}
@inproceedings{xu2023metaclip,
title={Demystifying CLIP Data},
author={Hu Xu, Saining Xie, Xiaoqing Ellen Tan, Po-Yao Huang, Russell Howes, Vasu Sharma, Shang-Wen Li, Gargi Ghosh, Luke Zettlemoyer and Christoph Feichtenhofer},
journal={arXiv preprint arXiv:2309.16671},
year={2023}
}
@inproceedings{xu2024altogether,
title={Altogether: Image Captioning via Re-aligning Alt-text},
author={Hu Xu, Po-Yao Huang, Xiaoqing Ellen Tan, Ching-Feng Yeh, Jacob Kahn, Christine Jou, Gargi Ghosh, Omer Levy, Luke Zettlemoyer, Wen-tau Yih, Shang-Wen Li, Saining Xie, Christoph Feichtenhofer},
journal={arXiv preprint arXiv:2410.17251},
year={2024}
}
@inproceedings{ma2024mode,
title={Mode: Clip data experts via clustering},
author={Jiawei Ma, Po-Yao Huang, Saining Xie, Shang-Wen Li, Luke Zettlemoyer, Shih-Fu Chang, Wen-Tau Yih and Hu Xu},
booktitle={Proceedings of the IEEE/CVF conference on computer vision and pattern recognition},
year={2024}
}
The training code is developed based on OpenCLIP, modified to the vanilla CLIP training setup.
- pip installation of metaclip package;
- refactor mini_clip with apps for MoDE, altogether.
The majority of MetaCLIP is licensed under CC-BY-NC, however portions of the project are available under separate license terms: open_clip is licensed under the https://github.com/mlfoundations/open_clip license.
We gratefully acknowledge the OpenCLIP team for initial CLIP codebase and integration and NielsRogge's integration into Huggingface.