Skip to content

The official implementation of "Learning Compact Vision Tokens for Efficient Large Multimodal Models"

License

Notifications You must be signed in to change notification settings

visresearch/LLaVA-STF

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Learning Compact Vision Tokens for Efficient Large Multimodal Models

This repository is the official implementation of "Learning Compact Vision Tokens for Efficient Large Multimodal Models".

[Paper] [BibTex] [HuggingFace]

framework

LLaVA-STF explores the spatial redundancy among vision tokens and shorten the length of vision token sequences for inference acceleration, where spatial-adjacent tokens are fused into one.

Meanwhile, weight-frozen vision encoder can not well adapt to the demand of extensive downstream vision-language tasks. To this end, we further introduce a Multi-Block Token Fusion (MBTF) module to supplement multi-granularity features for the reduced token sequence. Overall, we combine STC and MLTC module to balance token reduction and information preservation, thereby improving inference efficiency without sacrificing multimodal reasoning capabilities.

Experimental results demonstrate that our method based on LLaVA-1.5 achieves comparable or even superior performance to the baseline on 8 popular vision-language benchmarks with only 25% vision tokens of baseline.

The main results are illustrated in the below figure.

result

Install

  1. Clone this repository and navigate to LLaVA folder
git clone [link]
cd LLaVA
  1. Install Package
conda create -n llava python=3.10 -y
conda activate llava
pip install --upgrade pip 
pip install -e .
  1. Install additional packages for training cases
pip install -e ".[train]"
pip install flash-attn --no-build-isolation

Training

We follow the original LLaVA to conduct two-stage training: a pretraining stage for feature alignment, and a full parameter fine-tuning stage for visual instruction tuning. The training details are as follows.

  1. Download the training data for both pretraining and fine-tuning from the original LLaVA repository.
  2. Run the following command to pretrain the model:
    bash scripts/v1_5/pretrain.sh
  3. Run the following command to fine-tune the model:
    bash scripts/v1_5/finetune.sh

Hyperparameters

We use a similar set of hyperparameters as the original LLaVA. Both hyperparameters used in pretraining and fine-tuning are provided below.

  1. Pretraining
Hyperparameter Global Batch Size Learning rate Epochs Max length Weight decay
LLaVA-v1.5-7B 256 1e-3 1 2048 0
  1. Fine-tuning
Hyperparameter Global Batch Size Learning rate Epochs Max length Weight decay
LLaVA-v1.5-7B 128 2e-5 1 2048 0

Model Weights

Model Schedule Checkpoint VQAv2 GQA VizWiz SQA TextVQA POPE MME MM-Bench MM-Bench-CN
LLaVA-v1.5-7B (pretrain) 1 epoch download / / / / / / / / /
LLaVA-v1.5-7B (finetune) full_ft-1e download 78.1 61.9 51.1 70.5 57.4 86.0 1482.8 66.2 58.9

Evaluation

We evaluate models on the following 9 benchmarks.

VQAv2

  1. Download test2015 and put it under ./playground/data/eval/vqav2.
  2. Multi-GPU inference.
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash scripts/v1_5/eval/vqav2.sh
  1. Submit the results to the evaluation server.

GQA

  1. Download the data and evaluation scripts following the official instructions and put under ./playground/data/eval/gqa/data.
  2. Multi-GPU inference.
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash scripts/v1_5/eval/gqa.sh

VisWiz

  1. Download test.json and extract test.zip to test. Put them under ./playground/data/eval/vizwiz.
  2. Single-GPU inference.
CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/vizwiz.sh
  1. Submit the results to the evaluation server.

ScienceQA

  1. Under ./playground/data/eval/scienceqa, download images, pid_splits.json, problems.json from the data/scienceqa folder of the ScienceQA repo.
  2. Single-GPU inference and evaluate.
CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/sqa.sh

TextVQA

  1. Download TextVQA_0.5.1_val.json and images and extract to ./playground/data/eval/textvqa.
  2. Single-GPU inference and evaluate.
CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/textvqa.sh

POPE

  1. Download coco from POPE and put under ./playground/data/eval/pope.
  2. Single-GPU inference and evaluate.
CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/pope.sh

MME

  1. Download the data following the official instructions here.
  2. Downloaded images to MME_Benchmark_release_version.
  3. put the official eval_tool and MME_Benchmark_release_version under ./playground/data/eval/MME.
  4. Single-GPU inference and evaluate.
CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/mme.sh

MMBench

  1. Download mmbench_dev_20230712.tsv and put under ./playground/data/eval/mmbench.
  2. Single-GPU inference.
CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/mmbench.sh
  1. Submit the results to the evaluation server.

MMBench-CN

  1. Download mmbench_dev_cn_20231003.tsv and put under ./playground/data/eval/mmbench.
  2. Single-GPU inference.
CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/mmbench_cn.sh
  1. Submit the results to the evaluation server.

License

This project is under the CC-BY-NC 4.0 license. See LICENSE for details.

Citation

@article{tang2025compact,
  author  = {Tang, Hao and Shen, Chengchao},
  title   = {Learning Compact Vision Tokens for Efficient Large Multimodal Models},
  journal = {arXiv preprint arXiv:2506.07138},
  year    = {2025},
}