π Global Compression Commander: Plug-and-Play Inference Acceleration for High-Resolution Large Vision-Language Models π
Xuyang Liu1,
Ziming Wang2,
Yuhang Han3,
Yingyao Wang2,
Jiale Yuan2,
Jun Song2β,
Bo Zheng2,
Linfeng Zhang4,
Siteng Huang5,
Honggang Chen1β
1Sichuan University, 2Taobao & Tmall Group of Alibaba,
3Northwest Polytechnical University, 4Shanghai Jiao Tong University, 5Zhejiang University
-
2025.05.21
π€π€ We release our latest work VidCom2, a plug-and-play inference acceleration method of VideoLLMs. Code is available! -
2025.01.10
π€π€ We release our work GlobalCom2, a "global-to-local" approach for training-free acceleration of high-resolution LVLMs. Code is available!
TLDR: We present GlobalCom2, a novel plug-and-play token compression method for high-resolution LVLMs that evaluates the information richness of crops from a global perspective to preserve informative regions while removing redundancy.
The two key functions in llava/model/llava_arch.py
implement our global-guided local compression: (a) generate_scale_for_crop_features
for allocating optimal retention ratios based on each crop's global importance, and (b) interpolate_and_split_cls_attn_scores
for performing token compression with importance from the global perspective.
- Clone this repository.
git clone https://github.com/xuyang-liu16/GlobalCom2.git
cd GlobalCom2
- Environment Setup and Preparation
conda create -n GlobalCom2 python=3.10 -y
conda activate GlobalCom2
pip install -e .
- Download Multimodal Benchmark
Please follow the detailed instruction in LLaVA-Evaluation.
- Download LLaVA-NeXT-7B and LLaVA-NeXT-13B and put them under
./liuhaotian/llava-next-7b
and./liuhaotian/llava-next-13b
.
For users with limited access to Hugging Face (e.g., from mainland China), you can refer to this you can refer this alternative guide and use the following command, with LLaVA-NeXT-7B as an example:
pip install -U huggingface_hub hf_transfer -i https://mirrors.aliyun.com/pypi/simple/
export HF_ENDPOINT=https://hf-mirror.com
huggingface-cli download --resume-download liuhaotian/llava-v1.6-vicuna-7b --local-dir ./liuhaotian/llava-next-7b
π The only hyper-parameter is retention_ratio
in line 101 of llava/model/llava_arch.py
. You can achieve different acceleration effects by setting different retention_ratio
values (default retention_ratio = 0.25
).
Example for evaluating TextVQA results:
CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/textvqa.sh
Example for evaluating MME results:
CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/mme.sh
To calculate the theoretical computational efficiency shown above, we recommend the methodology presented in the work of LLM-Viewer. We deeply appreciate their outstanding contribution to this field.
To visualize the compression performance shown above, we recommend utilizing the visualization tools provided in tools, which include mask visualization and attention score visualization utilities. We hope these tools will assist in understanding the compression mechanism.
Please consider citing our paper in your publications, if our findings help your research.
@article{Liu2025:GlobalCom,
title={Compression with Global Guidance: Towards Training-free High-Resolution MLLMs Acceleration},
author={Xuyang Liu and Ziming Wang and Yuhang Han and Yingyao Wang and Jiale Yuan and Jun Song and Bo Zheng and Linfeng Zhang and Siteng Huang and Honggang Chen},
year={2025},
eprint={2501.05179},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
We extend our gratitude to the open-source efforts of LLaVA and LLM-Viewer.
For any question about our paper or code, please email liuxuyang@stu.scu.edu.cn
.