[Paper] [BibTex] [HuggingFace]
Large-scale model parameters lead to an unaffordable cost of computing and memory. We analyze popular transformer architectures and find that multilayer perceptron (MLP) modules take up the majority of model parameters. To this end, we focus on the recoverability of the compressed models and propose a Diversity-Guided MLP Reduction (DGMR) method to significantly reduce the parameters of large vision transformers with only negligible performance degradation.
conda create -n DGMR python=3.9
conda activate DGMR
pip install -r requirement.txt
For convenience, we organize the hyper-parameters in *.yaml
files at path ./configs
. To run the code, please edit these parameters according to your environment.
For the distillation of pruned Open CLIP models, you are required to set teacher.pretrained
, student.pretrained
, data.root
at configuration file configs/open_clip/distill_openclip.yaml
.
python prune.py --prune_method diversity \
--arch $model_arch \
--pretrained $model_path \
--mlp_ratio 1 \
--output_path $output_path
NGPU=$(nvidia-smi --query-gpu=name --format=csv,noheader | wc -l)
torchrun \
--nproc_per_node=$NGPU \
--master-port=29511 distill.py \
--config_file /path/to/config \
--frame $frame
- Zero-Shot Classification On ImageNet1K
NGPU=$(nvidia-smi --query-gpu=name --format=csv,noheader | wc -l)
torchrun --nnodes=1 --nproc_per_node=$NGPU --master-port=29502 \
eval/clip_eval_zsc_ddp_all.py \
--model $model \
--pretrained $ckptpath \
--batch_size $batch_size \
--save_clf /path/to/clf \
--dataset imagenet1k \
--dataset_root /path/to/ILSVRC2012
Tips: At the first evaluation, you are required to pass the save_clf
parameter, so the text encoding for zero-shot classification will be saved. For latter evaluation, you can set the load_clfs
parameter as the previous save_clf
to skip the running of text encoder.
- Zero-Shot Retrieval On COCO
python clip/clip_benchmark/cli.py eval \
--model $model \
--model_type $model_type \
--pretrained $ckptpath \
--language "en" \
--task "zeroshot_retrieval" \
--dataset "mscoco_captions" \
--dataset_root $coco_dataset_path \
--batch_size $batch_size \
--output $output_path \
--num_workers 2
- Zero-Shot Retrieval On Flickr30k
python clip/clip_benchmark/cli.py eval \
--model $model \
--model_type $model_type \
--pretrained $ckpt_path \
--language "en" \
--task "zeroshot_retrieval" \
--dataset "flickr30k" \
--dataset_root $flickr30k_dataset_path \
--batch_size $batch_size \
--output $output_dir \
--num_workers 2
- kNN
NGPU=$(nvidia-smi --query-gpu=name --format=csv,noheader | wc -l)
torchrun --nnodes=1 --nproc_per_node=$NGPU --master-port=29502 \
eval/eval_knn.py \
--frame $frame \
--config_file $config_file \
--output_dir $output_dir \
--batch_size $batch_size \
--weight $ckpt_path \
--arch $arch
The pruning and distilling is only for vision encoder, it must change the feature map, so we want to finetune the text encoder to let the model run at full health.
NGPU=$(nvidia-smi --query-gpu=name --format=csv,noheader | wc -l)
torchrun --nnodes=1 --nproc_per_node=$NGPU --master-port=29501 \
finetune.py \
--config_file $config_file \
--frame eva_clip
- Distillation for Vision Encoder
Arch | weights | zero-shot classification | zero-shot retrival | KNN |
---|---|---|---|---|
ViT-g-14 (original) | Link | 73.0% | 83.8% | 81.7% |
ViT-g-14-prune1 (ours) | Link | 73.0% | 83.8% | 81.9% |
ViT-g-14-prune2 (ours) | Link | 73.2% | 84.1% | 82.1% |
EVA02-CLIP-bigE-14-plus (original) | Link | 80.9% | 85.2% | 85.8% |
EVA02-CLIP-bigE-14-plus-prune1 (ours) | Link | 81.0% | 85.2% | 85.7% |
EVA02-CLIP-bigE-14-plus-prune2 (ours) | Link | 81.1% | 85.3% | 85.8% |
DINO v2 (original) | Link | / | / | 83.5% |
DINO v2 (ours) | Link | / | / | 83.5% |
- Finetuning Text Encoder
Arch | weights | zero-shot classification | zero-shot retrival |
---|---|---|---|
ViT-g-14 (original) | Link | 73.0% | 83.8% |
ViT-g-14-prune1 (ours) | Link | 73.0% | 84.0% |
ViT-g-14-prune2 (ours) | Link | 73.1% | 84.3% |
EVA02-CLIP-bigE-14-plus (original) | Link | 80.9% | 85.2% |
EVA02-CLIP-bigE-14-plus-prune1 (ours) | Link | 81.0% | 86.1% |
EVA02-CLIP-bigE-14-plus-prune2 (ours) | Link | 81.1% | 86.3% |
This project is under the CC-BY-NC 4.0 license. See LICENSE for details.
@article{shen2025diversity,
author = {Shen, Chengchao, Zhu, Hourun and Fang, Gongfan and Wang, Jianxin and Wang, Xinchao},
title = {Diversity-Guided MLP Reduction for Efficient Large Vision Transformers},
journal = {arXiv preprint arXiv:2506.07138},
year = {2025},
}