GenHancer: Imperfect Generative Models are Secretly Strong Vision-Centric Enhancers

Shijie Ma^1,2, Yuying Ge^1,✉, Teng Wang¹, Yuxin Guo^1,2, Yixiao Ge¹, Ying Shan¹
¹ARC Lab, Tencent PCG, ²Institute of Automation, CAS

⚡ TL;DR

How do generative models effectively help discriminative models?

We present in-depth explorations and propose a novel two-stage post-training strategy to enhance CLIP ViT's visual representations.

Our method is applicable to both continuous and discrete denoiser without the requirement for pre-trained weights.

📅 News

[2025-03-27] Training codes with continuous denoisers are released! 🔥🔥🔥
[2025-03-26] arXiv paper is made publicly available.
[2025-03-24] Release evaluation codes. 🔥
[2025-03-24] Release models weights on Huggingface🤗. 🔥🔥🔥
[2025-03-24] Release the project page of this repo.

🔜 TODOs

Release training codes of continuous denoisers.
Release training codes of discrete denoisers.

🔎 Introduction

Recent works demonstrate the feasibility of enhancing visual representations with generative models, where generative models take visual tokens as conditions and perform reconstruction. However, the underlying principle remains underexplored.

We empirically reveal that perfect generation (reconstruction) does not always yield desirable visual representasions, as shown below:

In this work, we delve into three aspects to explore the critical factors: (1) conditioning mechanisms, (2) denoising configurations and (3) generation paradigms.

We propose a two-stage post-training method to enhance CLIP ViT's fine-grained visual representations, which is efficient (with only lightweight denoisers) and versatile (applicable to both continuous and discrete denoisers). The pipeline of our method is illustrated below:

Important

We empirically found that, for visual representations, a visually perfect generative model is not optimal and not necessary.

Our method only employs lightweight generative models and does NOT require any pre-trained weights, which is efficient and could avoid potential privacy and copyright issues.

⭐ Released Weights

We release the enhanced CLIP weights on Huggingface🤗.

CLIP Backbone	MMVP-VLM (Original)	MMVP-VLM (Ours)	Link
OpenAICLIP ViT-L-14@224	19.3	31.9	🤗
OpenAICLIP ViT-L-14@336	20.0	29.6	🤗
MetaCLIP ViT-L-14@224	23.7	31.9	🤗
MetaCLIP ViT-H-14@224	25.2	37.0	🤗
SigLIP ViT-SO-14@224	37.8	42.2	🤗
SigLIP ViT-SO-14@384	37.0	40.0	🤗

🏃 Training

Please come into the corresponding directories for more details.

For the continuous denoiser, navigate into Continuous.

For the discrete denoiser, navigate into Discrete.

📏 Evaluation

Please first download the benchmark MMVP-VLM.

We provide evaluation scripts of six CLIP backbones. The example of OpenAICLIP@224 is as follows:

python evaluation/evaluate_mmvp_OpenAICLIP_224.py --benchmark_dir 'YOUR_MMVP_VLM_PATH' --vision_tower_name 'YOUR_VISION_TOWER'

Note

Please specify --vision_tower_name as your trained CLIP model, which is conventionally saved via save_pretrained().

If you want to evaluation the official CLIP model like OpenAICLIP@224, you could specify --vision_tower_name as the official hf_repo_id, e.g., openai/clip-vit-large-patch14.

🤗 Acknowledgements

When building the codebase of continuous denosiers, we refer to x-flux. Thanks for their wonderful project. Notably, we do NOT use their pre-trained weights.

📜 License

This repository is under the Apache 2 License.

📚 BibTeX

@article{ma2025genhancer,
	title={GenHancer: Imperfect Generative Models are Secretly Strong Vision-Centric Enhancers},
	author={Ma, Shijie and Ge, Yuying and Wang, Teng and Guo, Yuxin and Ge, Yixiao and Shan, Ying},
	journal={arXiv preprint arXiv:2503.19480},
	year={2025}
}

📧 Contact

If you have further questions, feel free to contact me: mashijie9817@gmail.com

Discussions and potential collaborations are also welcome.

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
Continuous		Continuous
Discrete		Discrete
assets		assets
evaluation		evaluation
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

GenHancer: Imperfect Generative Models are Secretly Strong Vision-Centric Enhancers

⚡ TL;DR

📅 News

🔜 TODOs

🔎 Introduction

⭐ Released Weights

🏃 Training

📏 Evaluation

🤗 Acknowledgements

📜 License

📚 BibTeX

📧 Contact

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

mashijie1028/GenHancer

Folders and files

Latest commit

History

Repository files navigation

GenHancer: Imperfect Generative Models are Secretly Strong Vision-Centric Enhancers

⚡ TL;DR

📅 News

🔜 TODOs

🔎 Introduction

⭐ Released Weights

🏃 Training

📏 Evaluation

🤗 Acknowledgements

📜 License

📚 BibTeX

📧 Contact

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages