VLA-OS: Structuring and Dissecting Planning Representations and Paradigms in Vision-Language-Action Models
📝Paper | 🌍Project Page | 🤗Model | 🛢️Data
VLA-OS is a unified framework for planning representations and paradigms research in vision-language-action (VLA) models. Specifically, VLA-OS offers the following features:
-
🏗️ Advanced VLA Designs
VLA-OS integrates multiple cutting-edge VLA design elements, including support for multi-view historical inputs, action chunking, a separate action head, block-wise causal attention for extracting visual-language model (VLM) features, and support for both L1 loss and flow-matching loss within a single network architecture. -
🔗 Modular, Scalable VLM Backbone
VLA-OS is agnostic to the choice of large-language or visual-language models: any Hugging Face LLM/VLM can be employed. Our paper presents model-scalability experiments on the same LLM model architecture (Qwen2.5) with only different number of parameters. -
🛠️ Composable Planning Heads for Different Planning Representations
A suite of composable planning heads is provided for different task planning representations: language reasoning, visual reasoning, and image foresight reasoning. Each of them can be seamlessly attached to the VLM backbone. -
🔄 Different Planning Paradigms
Using a unified codebase, VLA-OS implements three planning paradigms: Action-Only VLA, Integrated VLA, and Hierarchical VLA, enabling flexible exploration of planning strategies.
This repo is an official PyTorch implementation of VLA-OS, containing:
- 🛠️VLA-OS model implementation.
- 🤗Dataset of VLA-OS of LIBERO, The Colosseum, FurnitureBench, DexArt, PerAct2, and Real-World Deformable Object Manipulation tasks.
- 🤗Checkpoint of VLA-OS.
- 📈Training scripts (with DeepSpeed Accelerator for VLA and FSDP for VLM).
- 🤖Data transformation scripts for your own dataset.
- 🕹️Planning Data Labeling scripts for your custom dataset.
The following guides include the installation, VLM Pretraining, VLA Training, and Training on your own dataset.
- [2025/06/24] 🔥 Training Code released!
[] Add training code for continual learning [] Add evaluation code
This installation is for NVIDIA A100 80G with cuda 12.6.
# Clone this repo
git clone git@github.com:HeegerGao/VLA-OS.git
# Create a Conda environment
conda create -n vla python=3.10
conda activate vla
# Install PyTorch
pip3 install torch torchvision torchaudio
# Install LIBERO
git clone https://github.com/Lifelong-Robot-Learning/LIBERO.git
cd LIBERO
pip install -r requirements.txt
pip install -e .
# Install dlimp
git clone https://github.com/kvablack/dlimp
cd dlimp
pip install -e .
# Install Flash Attention
pip install flash-attn --no-build-isolation
# Install other prequisites
pip install -r requirements.txt
- Download the llava-v1.5-instruct Dataset
You can download it according to the Prismatic-VLMs instruction. Then move the unzipped dataset folder:
cd VLA-OS
mkdir dataset
mv YOUR_llava-v1.5-instruct_FOLDER dataset
- Train the VLM
bash commands/pertrain_vlm.sh --config=config/train_vlm.yaml
If you do not want to train the VLM by yourself, you can directly download the pretrained VLM checkpoint here.
-
Put your pretrained VLM checkpoint under
runs/qwen25-dinosiglip-224px+0_5b+stage-finetune+x42/checkpoints/latest-checkpoint.pt
. -
Download pretrained VAE (infinity_vae_d32reg.pth) for Image Foresight Planning from here.
-
Download the training datasets from our huggingface dataset repo.
-
Run the training script with the corresponding config. For example, train Integrated VLA on LIBERO-10:
bash commands/train_vla.sh --config=config/libero/libero_10/train_integrated_vla.yaml
Note if you want to train the Hierarchical VLA, you should first train a Integrated VLA to get the high-level checkpoint, and then train the low-level action head with it. Please refer to our paper for more details.
Please refer to this repo for more instructions.
If you find our work helpful, please cite us:
@article{gao2025vlaos,
title = {VLA-OS: Structuring and Dissecting Planning Representations and Paradigms in Vision-Language-Action Models},
author = {Gao, Chongkai and Liu, Zixuan and Chi, Zhenghao and Huang, Junshan and Fei, Xin and Hou, Yiwen and Zhang, Yuxuan and Lin, Yudi and Fang, Zhirui and Jiang, Zeyu and Shao, Lin},
journal = {arXiv preprint arXiv:2506.17561},
year = {2025},
url = {https://arxiv.org/abs/2506.17561}
}
Thank you!
All the code, model weights, and data are licensed under MIT license.