Vision-Language Transformers (VLTs) have achieved remarkable success, but their computational costs pose a challenge due to the large number of input tokens and extensive model parameters. Existing VLT compression methods primarily rely on single-modality-based token pruning or coarse-grained weight pruning techniques. However, these methods face significant obstacles, such as ignoring the critical alignment of different modalities and lacking the flexibility to dynamically compress each layer for token pruning, exhibiting inevitable performance degradation due to coarse-grained weight pruning, and struggling with the simultaneous compression of both input tokens and model parameters. To address those limitations, we propose MADTP++, a novel approach that integrates custom-made token and weight pruning processes into a unified framework, achieving superior compression in both parameter counts and computational costs.
Official implementation of MADTP++: Bridge the Gap between Token and Weight Pruning for Accelerating VLTs.
- (Apri 30, 2025), we released the
implementation
andscripts
of MADTP. (Note thatcheckpoints
andlogs
will come soon.)[Code] 🚩
The code is tested on Pytorch==1.11.0
, cuda==11.3.1
, and python==3.8.13
. The dependencies can be installed by:
conda env create -f environment.yml
Type | Supported Tasks | Supported Models | Supported Datasets |
---|---|---|---|
Multi-modal | Visual Reasoning | BLIP (instructions) | NLVR2 |
Multi-modal | Image Caption | BLIP (instructions) | COCO Caption |
Multi-modal | Visual Question Answer | BLIP (instructions) | VQAv2 |
Multi-modal | Image-Text Retrieval | CLIP (instructions), BLIP (instructions) | COCO, Flickr30k |
Multi-modal | Text-Image Retrieval | CLIP (instructions), BLIP (instructions) | COCO, Flickr30k |
Please refer to MADTP codebase for data organization and training evaluation
├── annotation
│ ├── answer_list.json
│ ├── coco_gt
│ │ ├── coco_karpathy_test_gt.json
│ │ └── coco_karpathy_val_gt.json
│ ├── ...
├── clip
├── compress_caption_dtp.py
├── compress_nlvr_dtp.py
├── compress ...
├── configs
├── data
├── datasets
│ └── vision
│ ├── coco
│ ├── flickr
│ ├── NLVR2
│ ├── ...
├── log
├── models
├── output
├── pretrained
│ ├── bert-base-uncased
│ ├── clip_large_retrieval_coco.pth
│ ├── clip_large_retrieval_flickr.pth
│ ├── ...
├──
├── transform
└── utils.py
This code is built upon BLIP, CLIP, UPop, and timm. We thank the original authors for their open-source work.
If you find this work useful, please consider citing the corresponding paper:
@article{cao2025madtp-plus,
title={MADTP++: Bridge the Gap between Token and Weight Pruning for Accelerating VLTs},
author={Jianjian, Cao and Chong, Yu and Peng, Ye and Tao, Chen},
year={2025}
}
@article{cao2024madtp,
title={MADTP: Multimodal Alignment-Guided Dynamic Token Pruning for Accelerating Vision-Language Transformer},
author={Jianjian, Cao and Peng, Ye and Shengze, Li and Chong, Yu and Yansong, Tang and Jiwen, Lu and Tao, Chen},
journal={IEEE Conference on Computer Vision and Pattern Recognition},
year={2024}
}