HiPrune: Training-Free Visual Token Pruning via Hierarchical Attention in Vision-Language Models

News

[2025/8/7] We have released notebooks for producing visualizations in our paper.
[2025/8/4] We have released our paper and code for LLaVA-1.5, LLaVA-NeXT, Qwen2.5-VL.

Abstract

Vision-Language Models (VLMs) encode images into lengthy sequences of visual tokens, leading to excessive computational overhead and limited inference efficiency. While prior efforts prune or merge tokens to address this issue, they often rely on special tokens (e.g., CLS) or require task-specific training, hindering scalability across architectures. In this paper, we propose HiPrune, a training-free and model-agnostic token Pruning framework that exploits the Hierarchical attention structure within vision encoders. We identify that middle layers attend to object-centric regions, while deep layers capture global contextual features. Based on this observation, HiPrune selects three types of informative tokens: (1) Anchor tokens with high attention in object-centric layers, (2) Buffer tokens adjacent to anchors for spatial continuity, and (3) Register tokens with strong attention in deep layers for global summarization. Our method requires no retraining and integrates seamlessly with any ViT-based VLM. Extensive experiments on LLaVA-1.5, LLaVA-NeXT, and Qwen2.5-VL demonstrate that HiPrune achieves state-of-the-art pruning performance, preserving up to 99.3% task accuracy with only 33.3% tokens, and maintaining 99.5% accuracy with just 11.1% tokens. Meanwhile, it reduces inference FLOPs and latency by up to 9$\times$, showcasing strong generalization across models and tasks.

Highlight

We identify that the representation evolves continuously and orderedly, with adjacent layers showing a similar pattern. This indicates a hierarchical attention pattern, where different phases of layers focus on different levels of image content.
We first plot the rankings of tokens according to their attention score. An obvious rule is that the middle layers of the encoder consistently focus on the main object. This is further confirmed by the IoU of the object mask and the top 10% high-attention tokens on the COCO dataset.
High-attention tokens in the deep layers show a uniform distribution across the whole image, indicating that the deep layers focus on tokens with global information, which is also confirmed by many previous works.
To draw a clear dividing line between different layers, we recognize attention patterns with average pairwise distance between high-attention tokens. Using this dispersion curve, we categorize CLIP layers into three regions: shallow (1–4), middle (5–9), and deep (10+), capturing a transition from noise to object detail and finally to global information.

Method

To better clarify our method, we first define three types of tokens:

Anchor tokens denote tokens with the highest attention score in the middle layers of the vision encoder. The middle layers tend to focus on the object features, manifested as higher attention scores for tokens related to the object. Anchor tokens encode rich, detailed information about the raw image.
Buffer tokens are spatially adjacent to anchor tokens in the original image. As many studies indicate, noise exists in the attention map of ViTs. Despite most high-attention tokens concentrating on the surf-man, a few tokens diffuse among the image, which may mislead the anchor tokens. Buffer tokens supply detailed information and mitigate the noise issue.
Register tokens receive top attention scores in the output layer of the vision encoder. In deep layers of the vision encoder, the high-attention tokens distribute uniformly across the image, serving as an ideal indicator of global information. Register tokens provide critical global information.

HiPrune retains these three types of tokens and discards the rest before the LLM. Token identification is also an ordered process, from anchor tokens to buffer tokens and finally register tokens, guided by the hierarchical attention from the vision encoder.

Setup and Install

Environment Setup. Since LLaVA and Qwen require different versions of transformers, so please setup the corresponding environment before running them.

Environment setup for LLaVA-1.5 and LLaVA-NeXT

bash setup_llava.sh

Environment setup for Qwen-2.5-VL

bash setup_qwen.sh

Data Access. Before starting your evaluation, please login with your huggingface token to get access to some datasets.

huggingface-cli login

Evaluation on LLaVA-1.5

Before starting evaluation on LLaVA-1.5, please follow its official instruction to prepare data for MMB, MMBCN, textVQA, and VQAv2.

Accuracy results

bash bench_llava.sh

Evaluation on LLaVA-NeXT

Accuracy results

bash bench_llava_next.sh

Efficiency results

python bench_sys.py

FLOPs results

bash flops.py

Evaluation on Qwen2.5-VL

Accuracy results

bash bench_qwen.sh

Hyperparameter Setting

If you want to change hyperparameters, you can simply set environment variables in this table.

Model	Environment Variables	Range	Default	Denote	Description
LLaVA-1.5	HIPRUNE_RETENTION	1-576	192/128/64	$N'$	Token budget
	HIPRUNE_ALPHA	0-1	0.1	$\alpha$	Proportation of anchor and buffer tokens
	HIPRUNE_OBJECT_LAYER	1-24	9	$l$	Object layer to choose anchor and buffer tokens
LLaVA-NeXT	HIPRUNE_RETENTION	1-2880	640/320/160	$N'$	Token budget
	HIPRUNE_ALPHA	0-1	0.1	$\alpha$	Proportation of anchor and buffer tokens
	HIPRUNE_OBJECT_LAYER	1-24	9	$l$	Object layer to choose anchor and buffer tokens
Qwen2.5-VL	HIPRUNE_QWEN_RETENTION	0-1	0.334/0.223/0.112	$\frac{N'}{N}$	Token retention ratio
	HIPRUNE_ALPHA	0-1	0.1	$\alpha$	Proportation of anchor and buffer tokens
	HIPRUNE_OBJECT_LAYER	1-24	16	$l$	Object layer to choose anchor and buffer tokens

Acknowledgement

This repository is built on LLaVA, FasterVLM, lmms-eval. Acknowledge their outstanding work!

Citation

If you found our work helpful, please consider leaving a star ⭐ and citing our work.

@article{liu2025hi,
  title={HiPrune: Training-Free Visual Token Pruning via Hierarchical Attention in Vision-Language Models},
  author={Jizhihui Liu, Feiyi Du, Guangdao Zhu, Niu Lian, Jun Li, Bin Chen},
  journal={arXiv preprint arXiv:2508.00553},
  year={2025}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

HiPrune: Training-Free Visual Token Pruning via Hierarchical Attention in Vision-Language Models

News

Abstract

Highlight

Method

Setup and Install

Evaluation on LLaVA-1.5

Evaluation on LLaVA-NeXT

Evaluation on Qwen2.5-VL

Hyperparameter Setting

Acknowledgement

Citation

About

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
LLaVA		LLaVA
Qwen2_5_VL		Qwen2_5_VL
assets		assets
lmms-eval		lmms-eval
visualizations		visualizations
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
bench_llava.sh		bench_llava.sh
bench_llava_next.sh		bench_llava_next.sh
bench_qwen.sh		bench_qwen.sh
bench_sys.py		bench_sys.py
flops.py		flops.py
setup_llava.sh		setup_llava.sh
setup_qwen.sh		setup_qwen.sh

License

lijun2005/Arxiv25-HiPrune

Folders and files

Latest commit

History

Repository files navigation

HiPrune: Training-Free Visual Token Pruning via Hierarchical Attention in Vision-Language Models

News

Abstract

Highlight

Method

Setup and Install

Evaluation on LLaVA-1.5

Evaluation on LLaVA-NeXT

Evaluation on Qwen2.5-VL

Hyperparameter Setting

Acknowledgement

Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Languages