Skip to content

MADTP++ is a novel approach that integrates tailored token and weight pruning processes into a unified framework, achieving superior compression in both parameter counts and computational costs

License

Notifications You must be signed in to change notification settings

double125/MADTP-plus

Repository files navigation

MADTP++: Bridge the Gap between Token and Weight Pruning for Accelerating VLTs

[Paper] [ArXiv] [Code]

Vision-Language Transformers (VLTs) have achieved remarkable success, but their computational costs pose a challenge due to the large number of input tokens and extensive model parameters. Existing VLT compression methods primarily rely on single-modality-based token pruning or coarse-grained weight pruning techniques. However, these methods face significant obstacles, such as ignoring the critical alignment of different modalities and lacking the flexibility to dynamically compress each layer for token pruning, exhibiting inevitable performance degradation due to coarse-grained weight pruning, and struggling with the simultaneous compression of both input tokens and model parameters. To address those limitations, we propose MADTP++, a novel approach that integrates custom-made token and weight pruning processes into a unified framework, achieving superior compression in both parameter counts and computational costs.

Official implementation of MADTP++: Bridge the Gap between Token and Weight Pruning for Accelerating VLTs.

What's New 🥳

  • (Apri 30, 2025), we released the implementation and scripts of MADTP. (Note that checkpoints and logs will come soon.)[Code] 🚩

Installation

The code is tested on Pytorch==1.11.0, cuda==11.3.1, and python==3.8.13. The dependencies can be installed by:

conda env create -f environment.yml

Supported Tasks, Models, and Datasets

Type Supported Tasks Supported Models Supported Datasets
Multi-modal Visual Reasoning BLIP (instructions) NLVR2
Multi-modal Image Caption BLIP (instructions) COCO Caption
Multi-modal Visual Question Answer BLIP (instructions) VQAv2
Multi-modal Image-Text Retrieval CLIP (instructions), BLIP (instructions) COCO, Flickr30k
Multi-modal Text-Image Retrieval CLIP (instructions), BLIP (instructions) COCO, Flickr30k

Please refer to MADTP codebase for data organization and training evaluation

Expected Folder Structures

├── annotation
│   ├── answer_list.json
│   ├── coco_gt
│   │   ├── coco_karpathy_test_gt.json
│   │   └── coco_karpathy_val_gt.json
│   ├── ...
├── clip                                               
├── compress_caption_dtp.py             
├── compress_nlvr_dtp.py                  
├── compress ...    
├── configs                                             
├── data                                        
├── datasets
│   └── vision
│       ├── coco
│       ├── flickr
│       ├── NLVR2     
│       ├── ...                                                                               
├── log                                     
├── models            
├── output                                    
├── pretrained
│   ├── bert-base-uncased
│   ├── clip_large_retrieval_coco.pth
│   ├── clip_large_retrieval_flickr.pth
│   ├── ...       
├──                                                                                
├── transform                                                                           
└── utils.py                                

Acknowledgments

This code is built upon BLIP, CLIP, UPop, and timm. We thank the original authors for their open-source work.

Citation

If you find this work useful, please consider citing the corresponding paper:

@article{cao2025madtp-plus,
  title={MADTP++: Bridge the Gap between Token and Weight Pruning for Accelerating VLTs},
  author={Jianjian, Cao and Chong, Yu and Peng, Ye and Tao, Chen},
  year={2025}
}
@article{cao2024madtp,
  title={MADTP: Multimodal Alignment-Guided Dynamic Token Pruning for Accelerating Vision-Language Transformer},
  author={Jianjian, Cao and Peng, Ye and Shengze, Li and Chong, Yu and Yansong, Tang and Jiwen, Lu and Tao, Chen},
  journal={IEEE Conference on Computer Vision and Pattern Recognition},
  year={2024}
}

About

MADTP++ is a novel approach that integrates tailored token and weight pruning processes into a unified framework, achieving superior compression in both parameter counts and computational costs

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published