GitHub - visresearch/LLaVA-STF: The official implementation of "Learning Compact Vision Tokens for Efficient Large Multimodal Models"

Learning Compact Vision Tokens for Efficient Large Multimodal Models

This repository is the official implementation of "Learning Compact Vision Tokens for Efficient Large Multimodal Models".

[Paper] [BibTex] [HuggingFace]

LLaVA-STF explores the spatial redundancy among vision tokens and shorten the length of vision token sequences for inference acceleration, where spatial-adjacent tokens are fused into one.

Meanwhile, weight-frozen vision encoder can not well adapt to the demand of extensive downstream vision-language tasks. To this end, we further introduce a Multi-Block Token Fusion (MBTF) module to supplement multi-granularity features for the reduced token sequence. Overall, we combine STC and MLTC module to balance token reduction and information preservation, thereby improving inference efficiency without sacrificing multimodal reasoning capabilities.

Experimental results demonstrate that our method based on LLaVA-1.5 achieves comparable or even superior performance to the baseline on 8 popular vision-language benchmarks with only 25% vision tokens of baseline.

The main results are illustrated in the below figure.

Install

Clone this repository and navigate to LLaVA folder

git clone [link]
cd LLaVA

Install Package

conda create -n llava python=3.10 -y
conda activate llava
pip install --upgrade pip 
pip install -e .

Install additional packages for training cases

pip install -e ".[train]"
pip install flash-attn --no-build-isolation

Training

We follow the original LLaVA to conduct two-stage training: a pretraining stage for feature alignment, and a full parameter fine-tuning stage for visual instruction tuning. The training details are as follows.

Download the training data for both pretraining and fine-tuning from the original LLaVA repository.
Run the following command to pretrain the model:
```
bash scripts/v1_5/pretrain.sh
```
Run the following command to fine-tune the model:
```
bash scripts/v1_5/finetune.sh
```

Hyperparameters

We use a similar set of hyperparameters as the original LLaVA. Both hyperparameters used in pretraining and fine-tuning are provided below.

Pretraining

Hyperparameter	Global Batch Size	Learning rate	Epochs	Max length	Weight decay
LLaVA-v1.5-7B	256	1e-3	1	2048	0

Fine-tuning

Hyperparameter	Global Batch Size	Learning rate	Epochs	Max length	Weight decay
LLaVA-v1.5-7B	128	2e-5	1	2048	0

Model Weights

Model	Schedule	Checkpoint	VQAv2	GQA	VizWiz	SQA	TextVQA	POPE	MME	MM-Bench	MM-Bench-CN
LLaVA-v1.5-7B (pretrain)	1 epoch	download	/	/	/	/	/	/	/	/	/
LLaVA-v1.5-7B (finetune)	full_ft-1e	download	78.1	61.9	51.1	70.5	57.4	86.0	1482.8	66.2	58.9

Evaluation

We evaluate models on the following 9 benchmarks.

VQAv2

Download test2015 and put it under ./playground/data/eval/vqav2.
Multi-GPU inference.

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash scripts/v1_5/eval/vqav2.sh

Submit the results to the evaluation server.

GQA

Download the data and evaluation scripts following the official instructions and put under ./playground/data/eval/gqa/data.
Multi-GPU inference.

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash scripts/v1_5/eval/gqa.sh

VisWiz

Download test.json and extract test.zip to test. Put them under ./playground/data/eval/vizwiz.
Single-GPU inference.

CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/vizwiz.sh

Submit the results to the evaluation server.

ScienceQA

Under ./playground/data/eval/scienceqa, download images, pid_splits.json, problems.json from the data/scienceqa folder of the ScienceQA repo.
Single-GPU inference and evaluate.

CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/sqa.sh

TextVQA

Download TextVQA_0.5.1_val.json and images and extract to ./playground/data/eval/textvqa.
Single-GPU inference and evaluate.

CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/textvqa.sh

POPE

Download coco from POPE and put under ./playground/data/eval/pope.
Single-GPU inference and evaluate.

CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/pope.sh

MME

Download the data following the official instructions here.
Downloaded images to MME_Benchmark_release_version.
put the official eval_tool and MME_Benchmark_release_version under ./playground/data/eval/MME.
Single-GPU inference and evaluate.

CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/mme.sh

MMBench

Download mmbench_dev_20230712.tsv and put under ./playground/data/eval/mmbench.
Single-GPU inference.

CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/mmbench.sh

Submit the results to the evaluation server.

MMBench-CN

Download mmbench_dev_cn_20231003.tsv and put under ./playground/data/eval/mmbench.
Single-GPU inference.

CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/mmbench_cn.sh

Submit the results to the evaluation server.

License

This project is under the CC-BY-NC 4.0 license. See LICENSE for details.

Citation

@article{tang2025compact,
  author  = {Tang, Hao and Shen, Chengchao},
  title   = {Learning Compact Vision Tokens for Efficient Large Multimodal Models},
  journal = {arXiv preprint arXiv:2506.07138},
  year    = {2025},
}

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
build/lib		build/lib
images		images
llava		llava
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
cog.yaml		cog.yaml
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Learning Compact Vision Tokens for Efficient Large Multimodal Models

Install

Training

Hyperparameters

Model Weights

Evaluation

VQAv2

GQA

VisWiz

ScienceQA

TextVQA

POPE

MME

MMBench

MMBench-CN

License

Citation

About

Uh oh!

Languages

License

visresearch/LLaVA-STF

Folders and files

Latest commit

History

Repository files navigation

Learning Compact Vision Tokens for Efficient Large Multimodal Models

Install

Training

Hyperparameters

Model Weights

Evaluation

VQAv2

GQA

VisWiz

ScienceQA

TextVQA

POPE

MME

MMBench

MMBench-CN

License

Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Languages