This repository is the official implementation of "Learning Compact Vision Tokens for Efficient Large Multimodal Models".
[Paper] [BibTex] [HuggingFace]
LLaVA-STF explores the spatial redundancy among vision tokens and shorten the length of vision token sequences for inference acceleration, where spatial-adjacent tokens are fused into one.
Meanwhile, weight-frozen vision encoder can not well adapt to the demand of extensive downstream vision-language tasks. To this end, we further introduce a Multi-Block Token Fusion (MBTF) module to supplement multi-granularity features for the reduced token sequence. Overall, we combine STC and MLTC module to balance token reduction and information preservation, thereby improving inference efficiency without sacrificing multimodal reasoning capabilities.
Experimental results demonstrate that our method based on LLaVA-1.5 achieves comparable or even superior performance to the baseline on 8 popular vision-language benchmarks with only 25% vision tokens of baseline.
The main results are illustrated in the below figure.
- Clone this repository and navigate to LLaVA folder
git clone [link]
cd LLaVA
- Install Package
conda create -n llava python=3.10 -y
conda activate llava
pip install --upgrade pip
pip install -e .
- Install additional packages for training cases
pip install -e ".[train]"
pip install flash-attn --no-build-isolation
We follow the original LLaVA to conduct two-stage training: a pretraining stage for feature alignment, and a full parameter fine-tuning stage for visual instruction tuning. The training details are as follows.
- Download the training data for both pretraining and fine-tuning from the original LLaVA repository.
- Run the following command to pretrain the model:
bash scripts/v1_5/pretrain.sh
- Run the following command to fine-tune the model:
bash scripts/v1_5/finetune.sh
We use a similar set of hyperparameters as the original LLaVA. Both hyperparameters used in pretraining and fine-tuning are provided below.
- Pretraining
Hyperparameter | Global Batch Size | Learning rate | Epochs | Max length | Weight decay |
---|---|---|---|---|---|
LLaVA-v1.5-7B | 256 | 1e-3 | 1 | 2048 | 0 |
- Fine-tuning
Hyperparameter | Global Batch Size | Learning rate | Epochs | Max length | Weight decay |
---|---|---|---|---|---|
LLaVA-v1.5-7B | 128 | 2e-5 | 1 | 2048 | 0 |
Model | Schedule | Checkpoint | VQAv2 | GQA | VizWiz | SQA | TextVQA | POPE | MME | MM-Bench | MM-Bench-CN |
---|---|---|---|---|---|---|---|---|---|---|---|
LLaVA-v1.5-7B (pretrain) | 1 epoch | download | / | / | / | / | / | / | / | / | / |
LLaVA-v1.5-7B (finetune) | full_ft-1e | download | 78.1 | 61.9 | 51.1 | 70.5 | 57.4 | 86.0 | 1482.8 | 66.2 | 58.9 |
We evaluate models on the following 9 benchmarks.
- Download
test2015
and put it under./playground/data/eval/vqav2
. - Multi-GPU inference.
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash scripts/v1_5/eval/vqav2.sh
- Submit the results to the evaluation server.
- Download the data and evaluation scripts following the official instructions and put under
./playground/data/eval/gqa/data
. - Multi-GPU inference.
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash scripts/v1_5/eval/gqa.sh
- Download
test.json
and extracttest.zip
totest
. Put them under./playground/data/eval/vizwiz
. - Single-GPU inference.
CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/vizwiz.sh
- Submit the results to the evaluation server.
- Under
./playground/data/eval/scienceqa
, downloadimages
,pid_splits.json
,problems.json
from thedata/scienceqa
folder of the ScienceQA repo. - Single-GPU inference and evaluate.
CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/sqa.sh
- Download
TextVQA_0.5.1_val.json
and images and extract to./playground/data/eval/textvqa
. - Single-GPU inference and evaluate.
CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/textvqa.sh
- Download
coco
from POPE and put under./playground/data/eval/pope
. - Single-GPU inference and evaluate.
CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/pope.sh
- Download the data following the official instructions here.
- Downloaded images to
MME_Benchmark_release_version
. - put the official
eval_tool
andMME_Benchmark_release_version
under./playground/data/eval/MME
. - Single-GPU inference and evaluate.
CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/mme.sh
- Download
mmbench_dev_20230712.tsv
and put under./playground/data/eval/mmbench
. - Single-GPU inference.
CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/mmbench.sh
- Submit the results to the evaluation server.
- Download
mmbench_dev_cn_20231003.tsv
and put under./playground/data/eval/mmbench
. - Single-GPU inference.
CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/mmbench_cn.sh
- Submit the results to the evaluation server.
This project is under the CC-BY-NC 4.0 license. See LICENSE for details.
@article{tang2025compact,
author = {Tang, Hao and Shen, Chengchao},
title = {Learning Compact Vision Tokens for Efficient Large Multimodal Models},
journal = {arXiv preprint arXiv:2506.07138},
year = {2025},
}