- [2025/8/7] We have released notebooks for producing visualizations in our paper.
- [2025/8/4] We have released our paper and code for LLaVA-1.5, LLaVA-NeXT, Qwen2.5-VL.
Vision-Language Models (VLMs) encode images into lengthy sequences of visual tokens, leading to excessive computational overhead and limited inference efficiency. While prior efforts prune or merge tokens to address this issue, they often rely on special tokens (e.g., CLS) or require task-specific training, hindering scalability across architectures. In this paper, we propose HiPrune, a training-free and model-agnostic token Pruning framework that exploits the Hierarchical attention structure within vision encoders. We identify that middle layers attend to object-centric regions, while deep layers capture global contextual features. Based on this observation, HiPrune selects three types of informative tokens: (1) Anchor tokens with high attention in object-centric layers, (2) Buffer tokens adjacent to anchors for spatial continuity, and (3) Register tokens with strong attention in deep layers for global summarization. Our method requires no retraining and integrates seamlessly with any ViT-based VLM. Extensive experiments on LLaVA-1.5, LLaVA-NeXT, and Qwen2.5-VL demonstrate that HiPrune achieves state-of-the-art pruning performance, preserving up to 99.3% task accuracy with only 33.3% tokens, and maintaining 99.5% accuracy with just 11.1% tokens. Meanwhile, it reduces inference FLOPs and latency by up to 9
- We identify that the representation evolves continuously and orderedly, with adjacent layers showing a similar pattern. This indicates a hierarchical attention pattern, where different phases of layers focus on different levels of image content.
- We first plot the rankings of tokens according to their attention score. An obvious rule is that the middle layers of the encoder consistently focus on the main object. This is further confirmed by the IoU of the object mask and the top 10% high-attention tokens on the COCO dataset.
- High-attention tokens in the deep layers show a uniform distribution across the whole image, indicating that the deep layers focus on tokens with global information, which is also confirmed by many previous works.
- To draw a clear dividing line between different layers, we recognize attention patterns with average pairwise distance between high-attention tokens. Using this dispersion curve, we categorize CLIP layers into three regions: shallow (1–4), middle (5–9), and deep (10+), capturing a transition from noise to object detail and finally to global information.
To better clarify our method, we first define three types of tokens:
- Anchor tokens denote tokens with the highest attention score in the middle layers of the vision encoder. The middle layers tend to focus on the object features, manifested as higher attention scores for tokens related to the object. Anchor tokens encode rich, detailed information about the raw image.
- Buffer tokens are spatially adjacent to anchor tokens in the original image. As many studies indicate, noise exists in the attention map of ViTs. Despite most high-attention tokens concentrating on the surf-man, a few tokens diffuse among the image, which may mislead the anchor tokens. Buffer tokens supply detailed information and mitigate the noise issue.
- Register tokens receive top attention scores in the output layer of the vision encoder. In deep layers of the vision encoder, the high-attention tokens distribute uniformly across the image, serving as an ideal indicator of global information. Register tokens provide critical global information.
HiPrune retains these three types of tokens and discards the rest before the LLM. Token identification is also an ordered process, from anchor tokens to buffer tokens and finally register tokens, guided by the hierarchical attention from the vision encoder.
Environment Setup. Since LLaVA and Qwen require different versions of transformers, so please setup the corresponding environment before running them.
- Environment setup for LLaVA-1.5 and LLaVA-NeXT
bash setup_llava.sh
- Environment setup for Qwen-2.5-VL
bash setup_qwen.sh
Data Access. Before starting your evaluation, please login with your huggingface token to get access to some datasets.
huggingface-cli login
Before starting evaluation on LLaVA-1.5, please follow its official instruction to prepare data for MMB, MMBCN, textVQA, and VQAv2.
- Accuracy results
bash bench_llava.sh
- Accuracy results
bash bench_llava_next.sh
- Efficiency results
python bench_sys.py
- FLOPs results
bash flops.py
- Accuracy results
bash bench_qwen.sh
If you want to change hyperparameters, you can simply set environment variables in this table.
Model | Environment Variables | Range | Default | Denote | Description |
---|---|---|---|---|---|
LLaVA-1.5 | HIPRUNE_RETENTION | 1-576 | 192/128/64 | Token budget | |
HIPRUNE_ALPHA | 0-1 | 0.1 | Proportation of anchor and buffer tokens | ||
HIPRUNE_OBJECT_LAYER | 1-24 | 9 | Object layer to choose anchor and buffer tokens | ||
LLaVA-NeXT | HIPRUNE_RETENTION | 1-2880 | 640/320/160 | Token budget | |
HIPRUNE_ALPHA | 0-1 | 0.1 | Proportation of anchor and buffer tokens | ||
HIPRUNE_OBJECT_LAYER | 1-24 | 9 | Object layer to choose anchor and buffer tokens | ||
Qwen2.5-VL | HIPRUNE_QWEN_RETENTION | 0-1 | 0.334/0.223/0.112 | Token retention ratio | |
HIPRUNE_ALPHA | 0-1 | 0.1 | Proportation of anchor and buffer tokens | ||
HIPRUNE_OBJECT_LAYER | 1-24 | 16 | Object layer to choose anchor and buffer tokens |
This repository is built on LLaVA, FasterVLM, lmms-eval. Acknowledge their outstanding work!
If you found our work helpful, please consider leaving a star ⭐ and citing our work.
@article{liu2025hi,
title={HiPrune: Training-Free Visual Token Pruning via Hierarchical Attention in Vision-Language Models},
author={Jizhihui Liu, Feiyi Du, Guangdao Zhu, Niu Lian, Jun Li, Bin Chen},
journal={arXiv preprint arXiv:2508.00553},
year={2025}
}