Meta CLIP

After years of advancements in English-centric CLIP development, MetaCLIP 2 is now taking the next step: scaling CLIP to worldwide data. The effort addresses long-standing challenges:

large-scale non-English data curation pipelines are largely undeveloped;
the curse of multilinguality, where English performance often degrades in multilingual CLIP compared to English-only CLIP.

With a complete recipe for worldwide CLIP—spanning data curation, modeling, and training—we show that English and non-English worlds can mutually benefit and elevate each other, achieving SoTA multilingual performance.

Updates

07/29/2025: 🔥 MetaCLIP 2 (worldwide) is released.
12/10/2024: 🔥 MetaCLIPv1.2 (ViT-H/14) trained with Altogether synthetic captions is released.
10/09/2024: 🔥 Altogether: Image Captioning via Re-aligning Alt-text (aka MetaCLIPv1.2) is accepted by EMNLP 2024 with code released.
08/15/2024: v0.1 released.
04/25/2024: 🔥 paper MoDE: CLIP Data Experts via Clustering is accepted by CVPR 2024 with code released.
01/18/2024: 🔥 add code for building metadata.
01/16/2024: 🔥 paper accepted by ICLR as spotlight presentation.
12/25/2023: Huggingface Space demo and Colab released.
12/21/2023: MetaCLIPv1.1 (ViT-G/14) released.
09/28/2023: initial release.

Quick Start

The pre-trained MetaCLIP models are available in

mini_clip (this repo)

import torch
from PIL import Image
from src.mini_clip.factory import create_model_and_transforms, get_tokenizer


model, _, preprocess = create_model_and_transforms('ViT-H-14-quickgelu-worldwide@WorldWideCLIP', pretrained='metaclip2_worldwide')
tokenizer = get_tokenizer("facebook/xlm-v-base")

image = preprocess(Image.open("docs/CLIP.png")).unsqueeze(0)
text = tokenize(["a diagram", "a dog", "a cat"])

with torch.no_grad():
    image_features = model.encode_image(image)
    text_features = model.encode_text(text)
    image_features /= image_features.norm(dim=-1, keepdim=True)
    text_features /= text_features.norm(dim=-1, keepdim=True)

    text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1)

print("Label probs:", text_probs)

Huggingface (MetaCLIP 1 only)

from PIL import Image
from transformers import AutoProcessor, AutoModel

processor = AutoProcessor.from_pretrained("facebook/metaclip-b32-400m")
model = AutoModel.from_pretrained("facebook/metaclip-b32-400m")

image = Image.open("docs/CLIP.png")
inputs = processor(text=["a diagram", "a dog", "a cat"], images=image, return_tensors="pt", padding=True)

with torch.no_grad():
  outputs = model(**inputs)
  logits_per_image = outputs.logits_per_image  # this is the image-text similarity score
  text_probs = logits_per_image.softmax(dim=-1)
print("Label probs:", text_probs)

Pre-trained Models

MetaCLIP closely adhere to OpenAI CLIP training and model setup (you mostly just need to replace the weights): to promote rigorous ablation studies and advance scientific understanding, as in the old "era of ImageNet".

MetaCLIP 2

`model_name`	`pretrained`	Data Card	# of Seen Pairs	Res.	CVQA-LOCAL ZS Acc.
`ViT-H-14-quickgelu-worldwide`	`metaclip2_worldwide`	Online Curation	29B	224	57.4
`ViT-H-14-378-worldwide`	`metaclip2_worldwide`	Online Curation	29B	378	58.2
`ViT-bigG-14-worldwide`	`metaclip2_worldwide`	Online Curation	29B	224	60.7
`ViT-bigG-14-378-worldwide`	`metaclip2_worldwide`	Online Curation	29B	378	62.0

(WIP): MetaCLIP 2: distilled smaller models and tokenizers.

MetaCLIP 1

`model_name`	`pretrained`	Data Card	# of Seen Pairs	Res.	GPUs	IN ZS Acc.
`ViT-B-32-quickgelu`	`metaclip_400m`	data card	12.8B	224	64 x V100	65.5
`ViT-B-16-quickgelu`	`metaclip_400m`	data card	12.8B	224	64 x V100	70.8
`ViT-L-14-quickgelu`	`metaclip_400m`	data card	12.8B	224	128 x V100	76.2
`ViT-B-32-quickgelu`	`metaclip_2_5b`	data card	12.8B	224	64 x V100	67.6
`ViT-B-16-quickgelu`	`metaclip_2_5b`	data card	12.8B	224	64 x V100	72.1
`ViT-L-14-quickgelu`	`metaclip_2_5b`	data card	12.8B	224	128 x V100	79.2
`ViT-H-14-quickgelu`	`metaclip_2_5b`	data card	12.8B	224	256 x A100	80.5
`ViT-bigG-14-quickgelu` (v1.1)	`metaclip_2_5b`	data card	12.8B	224	256 x A100	82.1
`ViT-H-14` (v1.2)	`metaclip_v1_2_altogether`	Online Curation	35B	224	256 x H100	82.0

Environment

This code is customized from OpenCLIP and will be maintained separately for research on MetaCLIP. The following command should install requirements for OpenCLIP and submitit=1.2.1 used by this repo:

conda create -n metaclip python=3.10 pytorch torchvision pytorch-cuda=11.7 tqdm ftfy braceexpand regex pandas submitit=1.2.1 \
    -c pytorch-nightly \
    -c nvidia \
    -c conda-forge \
    -c anaconda

Curation

See MetaCLIP 2 and MetaCLIP 1.

Bugs or questions?

If you have any questions related to the code or the paper, feel free to email Hu Xu (huxu@meta.com).

Citation

Please cite the following paper if MetaCLIP helps your work:

```bibtex
@inproceedings{Chuang2025metaclip2,
   title={MetaCLIP 2: A Worldwide Scaling Recipe},
   author={Yung-Sung Chuang, Yang Li, Dong Wang, Ching-Feng Yeh, Kehan Lyu, Ramya Raghavendra, James Glass, Lifei Huang, Jason Weston, Luke Zettlemoyer, Xinlei Chen, Zhuang Liu, Saining Xie, Wen-tau Yih, Shang-Wen Li and Hu Xu},
   journal={arXiv preprint arXiv:xxxx.xxxxx},
   year={2025}
}

@inproceedings{xu2023metaclip,
   title={Demystifying CLIP Data},
   author={Hu Xu, Saining Xie, Xiaoqing Ellen Tan, Po-Yao Huang, Russell Howes, Vasu Sharma, Shang-Wen Li, Gargi Ghosh, Luke Zettlemoyer and Christoph Feichtenhofer},
   journal={arXiv preprint arXiv:2309.16671},
   year={2023}
}

@inproceedings{xu2024altogether,
   title={Altogether: Image Captioning via Re-aligning Alt-text},
   author={Hu Xu, Po-Yao Huang, Xiaoqing Ellen Tan, Ching-Feng Yeh, Jacob Kahn, Christine Jou, Gargi Ghosh, Omer Levy, Luke Zettlemoyer, Wen-tau Yih, Shang-Wen Li, Saining Xie, Christoph Feichtenhofer},
   journal={arXiv preprint arXiv:2410.17251},
   year={2024}
}

@inproceedings{ma2024mode,
  title={Mode: Clip data experts via clustering},
  author={Jiawei Ma, Po-Yao Huang, Saining Xie, Shang-Wen Li, Luke Zettlemoyer, Shih-Fu Chang, Wen-Tau Yih and Hu Xu},
  booktitle={Proceedings of the IEEE/CVF conference on computer vision and pattern recognition},
  year={2024}
}

Reference

The training code is developed based on OpenCLIP, modified to the vanilla CLIP training setup.

TODO

pip installation of metaclip package;
refactor mini_clip with apps for MoDE, altogether.

License

The majority of MetaCLIP is licensed under CC-BY-NC, however portions of the project are available under separate license terms: open_clip is licensed under the https://github.com/mlfoundations/open_clip license.

Acknowledgement

We gratefully acknowledge the OpenCLIP team for initial CLIP codebase and integration and NielsRogge's integration into Huggingface.

Name		Name	Last commit message	Last commit date
Latest commit History 138 Commits
altogether		altogether
clipeval		clipeval
config		config
docs		docs
metaclip		metaclip
mode		mode
src		src
tests		tests
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
HISTORY.md		HISTORY.md
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
Makefile		Makefile
README.md		README.md
configs.py		configs.py
demo.ipynb		demo.ipynb
metadata.json		metadata.json
openclip_LICENSE		openclip_LICENSE
setup.py		setup.py
submit.py		submit.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Meta CLIP

Updates

Quick Start

Pre-trained Models

Environment

Curation

Bugs or questions?

Citation

Reference

TODO

License

Acknowledgement

About

Uh oh!

Releases

Uh oh!

Contributors 8

Uh oh!

Languages

License

facebookresearch/MetaCLIP

Folders and files

Latest commit

History

Repository files navigation

Meta CLIP

Updates

Quick Start

Pre-trained Models

Environment

Curation

Bugs or questions?

Citation

Reference

TODO

License

Acknowledgement

About

Resources

License

Code of conduct

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Uh oh!

Contributors 8

Uh oh!

Languages