VQ-VLA: Improving Vision-Language-Action Models via Scaling Vector-Quantized Action Tokenizers

ICCV 2025

Yating Wang, Haoyi Zhu, Mingyu Liu, Jiange Yang, Hao-Shu Fang, Tong He

VQ-VLA is an innovative vector quantization based action tokenizer built upon the largest-scale action trajectory dataset to date, leveraging over 100 times more data than previous approaches. It demonstrates that action tokenizers can be effectively scaled by leveraging large-scale simulated action data. We prove that our action tokenizers improve the performance, inference speed, and long-horizon capabilities of VLA models.

🔨 Installation

Setting up VQ-VLA Training Environment

# create conda environment
conda create -n vqvla python=3.10 -y
conda activate vqvla

# install PyTorch (adjust for your CUDA version)
pip install torch==2.2.0 torchvision==0.17.0 torchaudio==2.2.0 --index-url https://download.pytorch.org/whl/cu121

# clone project and install the vqvla repo
git clone https://github.com/xiaoxiao0406/VQ-VLA.git
cd vqvla
pip install -e .

# install Flash Attention 2 for training (https://github.com/Dao-AILab/flash-attention)
pip install packaging ninja
ninja --version; echo $?  # Verify Ninja --> should return exit code "0"
pip install "flash-attn==2.5.5" --no-build-isolation

Setting up VQ-VLA Evaluation Environment (LIBERO)

git clone https://github.com/Lifelong-Robot-Learning/LIBERO.git
cd LIBERO
pip install -e .

cd vqvla
pip install -r experiments/robot/libero/libero_requirements.txt

🔥 Fine-Tuning VQ-VLA via LoRA

Step 0: Download Dataset (Optional)

Download the Libero-90 RLDS dataset using the following command:

huggingface-cli download --resume-download --repo-type dataset VQ-VLA/libero_90_rlds --local-dir <YOUR_DATA_DIRECTORY>

Replace <YOUR_DATA_DIRECTORY> with your desired data storage path Note: If you want to train with your own dataset, you can convert your data to RLDS format by following the code at: rlds_dataset_builder

Step 1: Training VQ

bash scripts/train_action_vqvae.sh <TRAIN_DATASET_NAME> <WANDB_NAME> <YOUR_DATA_DIRECTORY>

# For example：
bash scripts/train_action_vqvae.sh libero_90_no_noops train_vq_libero_90 <YOUR_DATA_DIRECTORY>

Step 2: Finetune VQ-VLA

We use LoRA (Low-Rank Adaptation) to fine-tune the VLA model with a total batch size of 16:

torchrun --standalone --nnodes 1 --nproc-per-node 1 vla-scripts/finetune_vqvla.py \
  --vla_path openvla/openvla \  
  --data_root_dir <YOUR_DATA_DIRECTORY> \
  --dataset_name <DATASET_NAME> \
  --run_root_dir <PATH_TO_LOG/CHECKPOINT_DIRECTORY> \
  --lora_rank 32 \
  --batch_size 16 \
  --grad_accumulation_steps 1 \
  --learning_rate 5e-4 \
  --image_aug True \
  --max_steps 400000 \
  --checkpoint_path <VQ_CHECKPOINT_DIRECTORY>

We have fine-tuned OpenVLA on Libero-90 dataset. The model weights are available on Hugging Face: VQ-VLA/openvla-7b-finetuned-libero-90

🚀 VQ-VLA Evaluation (LIBERO)

We train an action tokenizer (using the largest version of the datasets) using data from datasets: Open X-Embodiment, RH20T, LIBERO, ManiSkill, RLBench. The trained action tokenizer weights and VQ-VLA model fine-tuned on LIBERO-90 are available on Hugging Face.

huggingface-cli download --resume-download VQ-VLA/vq-vla-weight --local-dir <YOUR_WEIGHT_DIRECTORY>

# LIBERO-90 eval
python experiments/robot/libero/run_libero_eval_vq_vla.py 
  --pretrained_checkpoint "<YOUR_WEIGHT_DIRECTORY>/vq-vla-weight/vqvla_weight" \
  --task_suite_name "libero_90" \
  --vqvae_ckpt "<YOUR_WEIGHT_DIRECTORY>/vq-vla-weight/action_tokenizer_weight/all_data_vq.pth"

✨ Acknowledgements

Our work is primarily built upon OpenVLA, Pyramid Flow, VQ-BeT, vector-quantize-pytorch, Open X-Embodiment, RH20T, LIBERO, ManiSkill, RLBench.

📚 License

This repository is licensed under the MIT License - see the LICENSE file for details. For any questions, please email to tonghe90[at]gmail[dot]com.

📝 Citation

If you find our code or models useful in your work, please cite our paper:

@inproceedings{wang25vqvla,
      title={VQ-VLA: Improving Vision-Language-Action Models via Scaling Vector-Quantized Action Tokenizers},
      author={Yating Wang, Haoyi Zhu, Mingyu Liu, Jiange Yang, Hao-Shu Fang, Tong He},
      booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision},
      year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
assets		assets
experiments/robot		experiments/robot
prismatic		prismatic
scripts		scripts
train_vae/scripts		train_vae/scripts
vla-scripts		vla-scripts
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

VQ-VLA: Improving Vision-Language-Action Models via Scaling Vector-Quantized Action Tokenizers

📋 Contents

🔨 Installation

🔥 Fine-Tuning VQ-VLA via LoRA

Step 0: Download Dataset (Optional)

Step 1: Training VQ

Step 2: Finetune VQ-VLA

🚀 VQ-VLA Evaluation (LIBERO)

✨ Acknowledgements

📚 License

📝 Citation

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

License

xiaoxiao0406/VQ-VLA

Folders and files

Latest commit

History

Repository files navigation

VQ-VLA: Improving Vision-Language-Action Models via Scaling Vector-Quantized Action Tokenizers

📋 Contents

🔨 Installation

🔥 Fine-Tuning VQ-VLA via LoRA

Step 0: Download Dataset (Optional)

Step 1: Training VQ

Step 2: Finetune VQ-VLA

🚀 VQ-VLA Evaluation (LIBERO)

✨ Acknowledgements

📚 License

📝 Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages