MADTP++: Bridge the Gap between Token and Weight Pruning for Accelerating VLTs

Vision-Language Transformers (VLTs) have achieved remarkable success, but their computational costs pose a challenge due to the large number of input tokens and extensive model parameters. Existing VLT compression methods primarily rely on single-modality-based token pruning or coarse-grained weight pruning techniques. However, these methods face significant obstacles, such as ignoring the critical alignment of different modalities and lacking the flexibility to dynamically compress each layer for token pruning, exhibiting inevitable performance degradation due to coarse-grained weight pruning, and struggling with the simultaneous compression of both input tokens and model parameters. To address those limitations, we propose MADTP++, a novel approach that integrates custom-made token and weight pruning processes into a unified framework, achieving superior compression in both parameter counts and computational costs.

Official implementation of MADTP++: Bridge the Gap between Token and Weight Pruning for Accelerating VLTs.

What's New 🥳

(Apri 30, 2025), we released the implementation and scripts of MADTP. (Note that checkpoints and logs will come soon.)[Code] 🚩

Installation

The code is tested on Pytorch==1.11.0, cuda==11.3.1, and python==3.8.13. The dependencies can be installed by:

conda env create -f environment.yml

Supported Tasks, Models, and Datasets

Type	Supported Tasks	Supported Models	Supported Datasets
Multi-modal	Visual Reasoning	BLIP (instructions)	NLVR2
Multi-modal	Image Caption	BLIP (instructions)	COCO Caption
Multi-modal	Visual Question Answer	BLIP (instructions)	VQAv2
Multi-modal	Image-Text Retrieval	CLIP (instructions), BLIP (instructions)	COCO, Flickr30k
Multi-modal	Text-Image Retrieval	CLIP (instructions), BLIP (instructions)	COCO, Flickr30k

Please refer to MADTP codebase for data organization and training evaluation

Expected Folder Structures

├── annotation
│   ├── answer_list.json
│   ├── coco_gt
│   │   ├── coco_karpathy_test_gt.json
│   │   └── coco_karpathy_val_gt.json
│   ├── ...
├── clip                                               
├── compress_caption_dtp.py             
├── compress_nlvr_dtp.py                  
├── compress ...    
├── configs                                             
├── data                                        
├── datasets
│   └── vision
│       ├── coco
│       ├── flickr
│       ├── NLVR2     
│       ├── ...                                                                               
├── log                                     
├── models            
├── output                                    
├── pretrained
│   ├── bert-base-uncased
│   ├── clip_large_retrieval_coco.pth
│   ├── clip_large_retrieval_flickr.pth
│   ├── ...       
├──                                                                                
├── transform                                                                           
└── utils.py

Acknowledgments

This code is built upon BLIP, CLIP, UPop, and timm. We thank the original authors for their open-source work.

Citation

If you find this work useful, please consider citing the corresponding paper:

@article{cao2025madtp-plus,
  title={MADTP++: Bridge the Gap between Token and Weight Pruning for Accelerating VLTs},
  author={Jianjian, Cao and Chong, Yu and Peng, Ye and Tao, Chen},
  year={2025}
}
@article{cao2024madtp,
  title={MADTP: Multimodal Alignment-Guided Dynamic Token Pruning for Accelerating Vision-Language Transformer},
  author={Jianjian, Cao and Peng, Ye and Shengze, Li and Chong, Yu and Yansong, Tang and Jiwen, Lu and Tao, Chen},
  journal={IEEE Conference on Computer Vision and Pattern Recognition},
  year={2024}
}

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
Pytorch_Sparse_API		Pytorch_Sparse_API
clip		clip
configs		configs
data		data
models		models
scripts		scripts
transform		transform
LICENSE		LICENSE
MADTP.png		MADTP.png
MADTP_plus.png		MADTP_plus.png
README.md		README.md
common.py		common.py
compress_caption_dtp.py		compress_caption_dtp.py
compress_nlvr_dtp.py		compress_nlvr_dtp.py
compress_retrieval_clip_dtp.py		compress_retrieval_clip_dtp.py
compress_retrieval_dtp.py		compress_retrieval_dtp.py
compress_retrieval_flickr_dtp.py		compress_retrieval_flickr_dtp.py
compress_vqa_dtp.py		compress_vqa_dtp.py
environment.yml		environment.yml
requirements.txt		requirements.txt
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

MADTP++: Bridge the Gap between Token and Weight Pruning for Accelerating VLTs

What's New 🥳

Installation

Supported Tasks, Models, and Datasets

Please refer to MADTP codebase for data organization and training evaluation

Expected Folder Structures

Acknowledgments

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

double125/MADTP-plus

Folders and files

Latest commit

History

Repository files navigation

MADTP++: Bridge the Gap between Token and Weight Pruning for Accelerating VLTs

What's New 🥳

Installation

Supported Tasks, Models, and Datasets

Please refer to MADTP codebase for data organization and training evaluation

Expected Folder Structures

Acknowledgments

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages