VLA-OS: Structuring and Dissecting Planning Representations and Paradigms in Vision-Language-Action Models

📝Paper | 🌍Project Page | 🤗Model | 🛢️Data

VLA-OS is a unified framework for planning representations and paradigms research in vision-language-action (VLA) models. Specifically, VLA-OS offers the following features:

🏗️ Advanced VLA Designs
VLA-OS integrates multiple cutting-edge VLA design elements, including support for multi-view historical inputs, action chunking, a separate action head, block-wise causal attention for extracting visual-language model (VLM) features, and support for both L1 loss and flow-matching loss within a single network architecture.
🔗 Modular, Scalable VLM Backbone
VLA-OS is agnostic to the choice of large-language or visual-language models: any Hugging Face LLM/VLM can be employed. Our paper presents model-scalability experiments on the same LLM model architecture (Qwen2.5) with only different number of parameters.
🛠️ Composable Planning Heads for Different Planning Representations
A suite of composable planning heads is provided for different task planning representations: language reasoning, visual reasoning, and image foresight reasoning. Each of them can be seamlessly attached to the VLM backbone.
🔄 Different Planning Paradigms
Using a unified codebase, VLA-OS implements three planning paradigms: Action-Only VLA, Integrated VLA, and Hierarchical VLA, enabling flexible exploration of planning strategies.

This repo is an official PyTorch implementation of VLA-OS, containing:

🛠️VLA-OS model implementation.
🤗Dataset of VLA-OS of LIBERO, The Colosseum, FurnitureBench, DexArt, PerAct2, and Real-World Deformable Object Manipulation tasks.
🤗Checkpoint of VLA-OS.
📈Training scripts (with DeepSpeed Accelerator for VLA and FSDP for VLM).
🤖Data transformation scripts for your own dataset.
🕹️Planning Data Labeling scripts for your custom dataset.

The following guides include the installation, VLM Pretraining, VLA Training, and Training on your own dataset.

📰 News

[2025/06/24] 🔥 Training Code released!

TODO

[] Add training code for continual learning [] Add evaluation code

Installation

This installation is for NVIDIA A100 80G with cuda 12.6.

# Clone this repo
git clone git@github.com:HeegerGao/VLA-OS.git

# Create a Conda environment
conda create -n vla python=3.10
conda activate vla

# Install PyTorch
pip3 install torch torchvision torchaudio

# Install LIBERO
git clone https://github.com/Lifelong-Robot-Learning/LIBERO.git
cd LIBERO
pip install -r requirements.txt
pip install -e .

# Install dlimp
git clone https://github.com/kvablack/dlimp
cd dlimp
pip install -e .

# Install Flash Attention
pip install flash-attn --no-build-isolation

# Install other prequisites
pip install -r requirements.txt

VLM Pretraining

Download the llava-v1.5-instruct Dataset

You can download it according to the Prismatic-VLMs instruction. Then move the unzipped dataset folder:

cd VLA-OS
mkdir dataset
mv YOUR_llava-v1.5-instruct_FOLDER dataset

Train the VLM

bash commands/pertrain_vlm.sh --config=config/train_vlm.yaml

If you do not want to train the VLM by yourself, you can directly download the pretrained VLM checkpoint here.

VLA Training

Put your pretrained VLM checkpoint under runs/qwen25-dinosiglip-224px+0_5b+stage-finetune+x42/checkpoints/latest-checkpoint.pt.
Download pretrained VAE (infinity_vae_d32reg.pth) for Image Foresight Planning from here.
Download the training datasets from our huggingface dataset repo.
Run the training script with the corresponding config. For example, train Integrated VLA on LIBERO-10:

bash commands/train_vla.sh --config=config/libero/libero_10/train_integrated_vla.yaml

Note if you want to train the Hierarchical VLA, you should first train a Integrated VLA to get the high-level checkpoint, and then train the low-level action head with it. Please refer to our paper for more details.

Training on Your Own Dataset

Please refer to this repo for more instructions.

Citation

If you find our work helpful, please cite us:

@article{gao2025vlaos,
  title   = {VLA-OS: Structuring and Dissecting Planning Representations and Paradigms in Vision-Language-Action Models},
  author  = {Gao, Chongkai and Liu, Zixuan and Chi, Zhenghao and Huang, Junshan and Fei, Xin and Hou, Yiwen and Zhang, Yuxuan and Lin, Yudi and Fang, Zhirui and Jiang, Zeyu and Shao, Lin},
  journal = {arXiv preprint arXiv:2506.17561},
  year    = {2025},
  url     = {https://arxiv.org/abs/2506.17561}
}

Thank you!

License

All the code, model weights, and data are licensed under MIT license.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

VLA-OS: Structuring and Dissecting Planning Representations and Paradigms in Vision-Language-Action Models

📝Paper | 🌍Project Page | 🤗Model | 🛢️Data

📰 News

TODO

Installation

VLM Pretraining

VLA Training

Training on Your Own Dataset

Citation

License

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
Qwen		Qwen
commands		commands
config		config
imgs		imgs
scripts		scripts
test		test
utils		utils
vlaos		vlaos
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

License

HeegerGao/VLA-OS

Folders and files

Latest commit

History

Repository files navigation

VLA-OS: Structuring and Dissecting Planning Representations and Paradigms in Vision-Language-Action Models

📝Paper | 🌍Project Page | 🤗Model | 🛢️Data

📰 News

TODO

Installation

VLM Pretraining

VLA Training

Training on Your Own Dataset

Citation

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages