VQ-VLA is an innovative vector quantization based action tokenizer built upon the largest-scale action trajectory dataset to date, leveraging over 100 times more data than previous approaches. It demonstrates that action tokenizers can be effectively scaled by leveraging large-scale simulated action data. We prove that our action tokenizers improve the performance, inference speed, and long-horizon capabilities of VLA models.
- Installation
- Fine-Tuning VQ-VLA via LoRA
- VQ-VLA Evaluation (LIBERO)
- Acknowledgements
- License
- Citation
- Setting up VQ-VLA Training Environment
# create conda environment
conda create -n vqvla python=3.10 -y
conda activate vqvla
# install PyTorch (adjust for your CUDA version)
pip install torch==2.2.0 torchvision==0.17.0 torchaudio==2.2.0 --index-url https://download.pytorch.org/whl/cu121
# clone project and install the vqvla repo
git clone https://github.com/xiaoxiao0406/VQ-VLA.git
cd vqvla
pip install -e .
# install Flash Attention 2 for training (https://github.com/Dao-AILab/flash-attention)
pip install packaging ninja
ninja --version; echo $? # Verify Ninja --> should return exit code "0"
pip install "flash-attn==2.5.5" --no-build-isolation
- Setting up VQ-VLA Evaluation Environment (LIBERO)
git clone https://github.com/Lifelong-Robot-Learning/LIBERO.git
cd LIBERO
pip install -e .
cd vqvla
pip install -r experiments/robot/libero/libero_requirements.txt
Download the Libero-90 RLDS dataset using the following command:
huggingface-cli download --resume-download --repo-type dataset VQ-VLA/libero_90_rlds --local-dir <YOUR_DATA_DIRECTORY>
Replace <YOUR_DATA_DIRECTORY> with your desired data storage path Note: If you want to train with your own dataset, you can convert your data to RLDS format by following the code at: rlds_dataset_builder
bash scripts/train_action_vqvae.sh <TRAIN_DATASET_NAME> <WANDB_NAME> <YOUR_DATA_DIRECTORY>
# For example:
bash scripts/train_action_vqvae.sh libero_90_no_noops train_vq_libero_90 <YOUR_DATA_DIRECTORY>
We use LoRA (Low-Rank Adaptation) to fine-tune the VLA model with a total batch size of 16:
torchrun --standalone --nnodes 1 --nproc-per-node 1 vla-scripts/finetune_vqvla.py \
--vla_path openvla/openvla \
--data_root_dir <YOUR_DATA_DIRECTORY> \
--dataset_name <DATASET_NAME> \
--run_root_dir <PATH_TO_LOG/CHECKPOINT_DIRECTORY> \
--lora_rank 32 \
--batch_size 16 \
--grad_accumulation_steps 1 \
--learning_rate 5e-4 \
--image_aug True \
--max_steps 400000 \
--checkpoint_path <VQ_CHECKPOINT_DIRECTORY>
We have fine-tuned OpenVLA on Libero-90 dataset. The model weights are available on Hugging Face: VQ-VLA/openvla-7b-finetuned-libero-90
We train an action tokenizer (using the largest version of the datasets) using data from datasets: Open X-Embodiment, RH20T, LIBERO, ManiSkill, RLBench. The trained action tokenizer weights and VQ-VLA model fine-tuned on LIBERO-90 are available on Hugging Face.
huggingface-cli download --resume-download VQ-VLA/vq-vla-weight --local-dir <YOUR_WEIGHT_DIRECTORY>
# LIBERO-90 eval
python experiments/robot/libero/run_libero_eval_vq_vla.py
--pretrained_checkpoint "<YOUR_WEIGHT_DIRECTORY>/vq-vla-weight/vqvla_weight" \
--task_suite_name "libero_90" \
--vqvae_ckpt "<YOUR_WEIGHT_DIRECTORY>/vq-vla-weight/action_tokenizer_weight/all_data_vq.pth"
Our work is primarily built upon OpenVLA, Pyramid Flow, VQ-BeT, vector-quantize-pytorch, Open X-Embodiment, RH20T, LIBERO, ManiSkill, RLBench.
This repository is licensed under the MIT License - see the LICENSE file for details. For any questions, please email to tonghe90[at]gmail[dot]com.
If you find our code or models useful in your work, please cite our paper:
@inproceedings{wang25vqvla,
title={VQ-VLA: Improving Vision-Language-Action Models via Scaling Vector-Quantized Action Tokenizers},
author={Yating Wang, Haoyi Zhu, Mingyu Liu, Jiange Yang, Hao-Shu Fang, Tong He},
booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision},
year={2025}
}