GitHub - visresearch/DGMR: The official implementation of "Diversity-Guided MLP Reduction for Efficient Large Vision Transformers"

Diversity-Guided MLP Reduction for Efficient Large Vision Transformers

Large-scale model parameters lead to an unaffordable cost of computing and memory. We analyze popular transformer architectures and find that multilayer perceptron (MLP) modules take up the majority of model parameters. To this end, we focus on the recoverability of the compressed models and propose a Diversity-Guided MLP Reduction (DGMR) method to significantly reduce the parameters of large vision transformers with only negligible performance degradation.

1. Installation

conda create -n DGMR python=3.9
conda activate DGMR
pip install -r requirement.txt

2. Configuration

For convenience, we organize the hyper-parameters in *.yaml files at path ./configs. To run the code, please edit these parameters according to your environment.

For the distillation of pruned Open CLIP models, you are required to set teacher.pretrained, student.pretrained, data.root at configuration file configs/open_clip/distill_openclip.yaml.

3. MLP Reduction

python prune.py --prune_method diversity \
      --arch $model_arch \
      --pretrained $model_path \
      --mlp_ratio 1 \
      --output_path $output_path

4. Distillation

NGPU=$(nvidia-smi --query-gpu=name --format=csv,noheader | wc -l)
torchrun \
    --nproc_per_node=$NGPU \
    --master-port=29511 distill.py \
    --config_file /path/to/config \
    --frame $frame

5. Evaluation

Zero-Shot Classification On ImageNet1K

NGPU=$(nvidia-smi --query-gpu=name --format=csv,noheader | wc -l)

torchrun --nnodes=1 --nproc_per_node=$NGPU --master-port=29502 \
	eval/clip_eval_zsc_ddp_all.py \
    --model $model \
    --pretrained $ckptpath \
    --batch_size $batch_size \
    --save_clf /path/to/clf \
    --dataset imagenet1k \
    --dataset_root  /path/to/ILSVRC2012

Tips: At the first evaluation, you are required to pass the save_clf parameter, so the text encoding for zero-shot classification will be saved. For latter evaluation, you can set the load_clfs parameter as the previous save_clf to skip the running of text encoder.

Zero-Shot Retrieval On COCO

python clip/clip_benchmark/cli.py eval \
  --model $model \
  --model_type $model_type \
  --pretrained $ckptpath \
  --language "en" \
  --task "zeroshot_retrieval" \
  --dataset "mscoco_captions" \
  --dataset_root $coco_dataset_path \
  --batch_size $batch_size \
  --output $output_path \
  --num_workers 2

Zero-Shot Retrieval On Flickr30k

python clip/clip_benchmark/cli.py eval \
  --model $model \
  --model_type $model_type \
  --pretrained $ckpt_path \
  --language "en" \
  --task "zeroshot_retrieval" \
  --dataset "flickr30k" \
  --dataset_root $flickr30k_dataset_path \
  --batch_size $batch_size \
  --output $output_dir \
  --num_workers 2

kNN

NGPU=$(nvidia-smi --query-gpu=name --format=csv,noheader | wc -l)

torchrun --nnodes=1 --nproc_per_node=$NGPU --master-port=29502 \
	eval/eval_knn.py \
	--frame $frame \
    --config_file $config_file \
    --output_dir $output_dir \
    --batch_size $batch_size \
    --weight $ckpt_path \ 
    --arch $arch

6. Finetuning Text Encoder

The pruning and distilling is only for vision encoder, it must change the feature map, so we want to finetune the text encoder to let the model run at full health.

NGPU=$(nvidia-smi --query-gpu=name --format=csv,noheader | wc -l)

torchrun --nnodes=1 --nproc_per_node=$NGPU --master-port=29501 \
	finetune.py \
	--config_file $config_file \
	--frame eva_clip

7. Model Zoo

Distillation for Vision Encoder

Arch	weights	zero-shot classification	zero-shot retrival	KNN
ViT-g-14 (original)	Link	73.0%	83.8%	81.7%
ViT-g-14-prune1 (ours)	Link	73.0%	83.8%	81.9%
ViT-g-14-prune2 (ours)	Link	73.2%	84.1%	82.1%
EVA02-CLIP-bigE-14-plus (original)	Link	80.9%	85.2%	85.8%
EVA02-CLIP-bigE-14-plus-prune1 (ours)	Link	81.0%	85.2%	85.7%
EVA02-CLIP-bigE-14-plus-prune2 (ours)	Link	81.1%	85.3%	85.8%
DINO v2 (original)	Link	/	/	83.5%
DINO v2 (ours)	Link	/	/	83.5%

Finetuning Text Encoder

Arch	weights	zero-shot classification	zero-shot retrival
ViT-g-14 (original)	Link	73.0%	83.8%
ViT-g-14-prune1 (ours)	Link	73.0%	84.0%
ViT-g-14-prune2 (ours)	Link	73.1%	84.3%
EVA02-CLIP-bigE-14-plus (original)	Link	80.9%	85.2%
EVA02-CLIP-bigE-14-plus-prune1 (ours)	Link	81.0%	86.1%
EVA02-CLIP-bigE-14-plus-prune2 (ours)	Link	81.1%	86.3%

License

This project is under the CC-BY-NC 4.0 license. See LICENSE for details.

Citation

@article{shen2025diversity,
  author  = {Shen, Chengchao, Zhu, Hourun and Fang, Gongfan and Wang, Jianxin and Wang, Xinchao},
  title   = {Diversity-Guided MLP Reduction for Efficient Large Vision Transformers},
  journal = {arXiv preprint arXiv:2506.07138},
  year    = {2025},
}

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
clip		clip
configs		configs
dinov2		dinov2
eval		eval
images		images
module		module
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
distill.py		distill.py
finetune.py		finetune.py
parse.py		parse.py
prune.py		prune.py
requirements.txt		requirements.txt
transforms.py		transforms.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Diversity-Guided MLP Reduction for Efficient Large Vision Transformers

1. Installation

2. Configuration

3. MLP Reduction

4. Distillation

5. Evaluation

6. Finetuning Text Encoder

7. Model Zoo

License

Citation

About

Uh oh!

Languages

License

visresearch/DGMR

Folders and files

Latest commit

History

Repository files navigation

Diversity-Guided MLP Reduction for Efficient Large Vision Transformers

1. Installation

2. Configuration

3. MLP Reduction

4. Distillation

5. Evaluation

6. Finetuning Text Encoder

7. Model Zoo

License

Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Languages