Skip to content

HeegerGao/VLA-OS

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

20 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

VLA-OS: Structuring and Dissecting Planning Representations and Paradigms in Vision-Language-Action Models

📝Paper | 🌍Project Page | 🤗Model | 🛢️Data

VLA-OS is a unified framework for planning representations and paradigms research in vision-language-action (VLA) models. Specifically, VLA-OS offers the following features:

  • 🏗️ Advanced VLA Designs
    VLA-OS integrates multiple cutting-edge VLA design elements, including support for multi-view historical inputs, action chunking, a separate action head, block-wise causal attention for extracting visual-language model (VLM) features, and support for both L1 loss and flow-matching loss within a single network architecture.

  • 🔗 Modular, Scalable VLM Backbone
    VLA-OS is agnostic to the choice of large-language or visual-language models: any Hugging Face LLM/VLM can be employed. Our paper presents model-scalability experiments on the same LLM model architecture (Qwen2.5) with only different number of parameters.

  • 🛠️ Composable Planning Heads for Different Planning Representations
    A suite of composable planning heads is provided for different task planning representations: language reasoning, visual reasoning, and image foresight reasoning. Each of them can be seamlessly attached to the VLM backbone.

  • 🔄 Different Planning Paradigms
    Using a unified codebase, VLA-OS implements three planning paradigms: Action-Only VLA, Integrated VLA, and Hierarchical VLA, enabling flexible exploration of planning strategies.

This repo is an official PyTorch implementation of VLA-OS, containing:

The following guides include the installation, VLM Pretraining, VLA Training, and Training on your own dataset.

📰 News

  • [2025/06/24] 🔥 Training Code released!

TODO

[] Add training code for continual learning [] Add evaluation code

Installation

This installation is for NVIDIA A100 80G with cuda 12.6.

# Clone this repo
git clone git@github.com:HeegerGao/VLA-OS.git

# Create a Conda environment
conda create -n vla python=3.10
conda activate vla

# Install PyTorch
pip3 install torch torchvision torchaudio

# Install LIBERO
git clone https://github.com/Lifelong-Robot-Learning/LIBERO.git
cd LIBERO
pip install -r requirements.txt
pip install -e .

# Install dlimp
git clone https://github.com/kvablack/dlimp
cd dlimp
pip install -e .

# Install Flash Attention
pip install flash-attn --no-build-isolation

# Install other prequisites
pip install -r requirements.txt

VLM Pretraining

  1. Download the llava-v1.5-instruct Dataset

You can download it according to the Prismatic-VLMs instruction. Then move the unzipped dataset folder:

cd VLA-OS
mkdir dataset
mv YOUR_llava-v1.5-instruct_FOLDER dataset
  1. Train the VLM
bash commands/pertrain_vlm.sh --config=config/train_vlm.yaml

If you do not want to train the VLM by yourself, you can directly download the pretrained VLM checkpoint here.

VLA Training

  1. Put your pretrained VLM checkpoint under runs/qwen25-dinosiglip-224px+0_5b+stage-finetune+x42/checkpoints/latest-checkpoint.pt.

  2. Download pretrained VAE (infinity_vae_d32reg.pth) for Image Foresight Planning from here.

  3. Download the training datasets from our huggingface dataset repo.

  4. Run the training script with the corresponding config. For example, train Integrated VLA on LIBERO-10:

bash commands/train_vla.sh --config=config/libero/libero_10/train_integrated_vla.yaml

Note if you want to train the Hierarchical VLA, you should first train a Integrated VLA to get the high-level checkpoint, and then train the low-level action head with it. Please refer to our paper for more details.

Training on Your Own Dataset

Please refer to this repo for more instructions.

Citation

If you find our work helpful, please cite us:

@article{gao2025vlaos,
  title   = {VLA-OS: Structuring and Dissecting Planning Representations and Paradigms in Vision-Language-Action Models},
  author  = {Gao, Chongkai and Liu, Zixuan and Chi, Zhenghao and Huang, Junshan and Fei, Xin and Hou, Yiwen and Zhang, Yuxuan and Lin, Yudi and Fang, Zhirui and Jiang, Zeyu and Shao, Lin},
  journal = {arXiv preprint arXiv:2506.17561},
  year    = {2025},
  url     = {https://arxiv.org/abs/2506.17561}
}

Thank you!

License

All the code, model weights, and data are licensed under MIT license.

About

Official Code For VLA-OS.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages