Shijie Ma1,2,
Yuying Ge1,✉,
Teng Wang1,
Yuxin Guo1,2,
Yixiao Ge1,
Ying Shan1
1ARC Lab, Tencent PCG,
2Institute of Automation, CAS
How do generative models effectively help discriminative models?
We present in-depth explorations and propose a novel two-stage post-training strategy to enhance CLIP ViT's visual representations.
Our method is applicable to both continuous and discrete denoiser without the requirement for pre-trained weights.
- [2025-03-27] Training codes with continuous denoisers are released! 🔥🔥🔥
- [2025-03-26] arXiv paper is made publicly available.
- [2025-03-24] Release evaluation codes. 🔥
- [2025-03-24] Release models weights on Huggingface🤗. 🔥🔥🔥
- [2025-03-24] Release the project page of this repo.
- Release training codes of continuous denoisers.
- Release training codes of discrete denoisers.
Recent works demonstrate the feasibility of enhancing visual representations with generative models, where generative models take visual tokens as conditions and perform reconstruction. However, the underlying principle remains underexplored.
We empirically reveal that perfect generation (reconstruction) does not always yield desirable visual representasions, as shown below:
In this work, we delve into three aspects to explore the critical factors: (1) conditioning mechanisms, (2) denoising configurations and (3) generation paradigms.
We propose a two-stage post-training method to enhance CLIP ViT's fine-grained visual representations, which is efficient (with only lightweight denoisers) and versatile (applicable to both continuous and discrete denoisers). The pipeline of our method is illustrated below:
Important
We empirically found that, for visual representations, a visually perfect generative model is not optimal and not necessary.
Our method only employs lightweight generative models and does NOT require any pre-trained weights, which is efficient and could avoid potential privacy and copyright issues.
We release the enhanced CLIP weights on Huggingface🤗.
CLIP Backbone | MMVP-VLM (Original) | MMVP-VLM (Ours) | Link |
---|---|---|---|
OpenAICLIP ViT-L-14@224 | 19.3 | 31.9 | 🤗 |
OpenAICLIP ViT-L-14@336 | 20.0 | 29.6 | 🤗 |
MetaCLIP ViT-L-14@224 | 23.7 | 31.9 | 🤗 |
MetaCLIP ViT-H-14@224 | 25.2 | 37.0 | 🤗 |
SigLIP ViT-SO-14@224 | 37.8 | 42.2 | 🤗 |
SigLIP ViT-SO-14@384 | 37.0 | 40.0 | 🤗 |
Please come into the corresponding directories for more details.
For the continuous denoiser, navigate into Continuous.
For the discrete denoiser, navigate into Discrete.
Please first download the benchmark MMVP-VLM.
We provide evaluation scripts of six CLIP backbones. The example of OpenAICLIP@224 is as follows:
python evaluation/evaluate_mmvp_OpenAICLIP_224.py --benchmark_dir 'YOUR_MMVP_VLM_PATH' --vision_tower_name 'YOUR_VISION_TOWER'
Note
Please specify --vision_tower_name
as your trained CLIP model, which is conventionally saved via save_pretrained()
.
If you want to evaluation the official CLIP model like OpenAICLIP@224, you could specify --vision_tower_name
as the official hf_repo_id
, e.g., openai/clip-vit-large-patch14
.
When building the codebase of continuous denosiers, we refer to x-flux. Thanks for their wonderful project. Notably, we do NOT use their pre-trained weights.
This repository is under the Apache 2 License.
@article{ma2025genhancer,
title={GenHancer: Imperfect Generative Models are Secretly Strong Vision-Centric Enhancers},
author={Ma, Shijie and Ge, Yuying and Wang, Teng and Guo, Yuxin and Ge, Yixiao and Shan, Ying},
journal={arXiv preprint arXiv:2503.19480},
year={2025}
}
If you have further questions, feel free to contact me: mashijie9817@gmail.com
Discussions and potential collaborations are also welcome.