Skip to content

Fanqi-Lin/OneTwoVLA

Repository files navigation

OneTwoVLA: A Unified Vision-Language-Action Model with Adaptive Reasoning

[Project Page] [Paper] [Processed Datasets]

Fanqi Lin1,2,3,5*, Ruiqian Nai1,2,3,5*, Yingdong Hu1,2,3*, Jiacheng You1,2,3, Junming Zhao1,4, Yang Gao1,2,3,5

1Tsinghua University, 2Shanghai Qi Zhi Institute, 3Shanghai AI Lab, 4Fudan University, 5Spirit AI

* indicates equal contributions

🛠️ Installation

We manage Python dependencies with uv. If you haven't installed uv, please follow uv installation instructions to set it up.

Run the following to set up the environment:

GIT_LFS_SKIP_SMUDGE=1 uv sync
GIT_LFS_SKIP_SMUDGE=1 uv pip install -e .

NOTE: GIT_LFS_SKIP_SMUDGE=1 is needed to pull LeRobot as a dependency.

For more details, refer to the original openpi repository.

🚀 Training OneTwoVLA

Download the dataset and place them under $LEROBOT_HOME/umi/.

To train a OneTwoVLA model, run:

bash train_scripts/train_<task_name>.sh

Available tasks are:

train_scripts
|-- train_onetwovla_cocktail.sh
|-- train_onetwovla_visual_grounding.sh
|-- train_pi0_cocktail.sh
|-- train_pi0_visual_grounding.sh

🦾 Real-World Deployment

We run inference using a policy server and a hardware client. The instructions for running policy server can be found at examples/umi/README.md, and we provide the UMI hardware client code in this repository.

📷 Data

We provide access to the following datasets:

  • Robot Datasets: Datasets for the cocktail and open-world visual grounding tasks.
  • Vision-Language Datasets: Datasets contains synthetic images and annotated reasoning for the open-world visual grounding task.

All datasets are hosted on Hugging Face. You can find them here.

We provide code for converting UMI data format to LeRobot data format here.

Synthetic Image Augmentation

To make the synthetic images more closely resemble real robot observations, we randomly apply several augmentations, including random fisheye distortion and compositing a robot gripper with adaptive brightness adjustments. The implementation is available in scripts/augment_vl_data/augment.py.

Here we show an example. From left to right, the images are: the original image, the image with fisheye distortion, the image compositing a robot gripper with adaptive brightness adjustments, and the image with both applied.

🙏 Acknowledgements

We express our sincere gratitude to the developers of the openpi for open-sourcing their code.

About

Official implementation of "OneTwoVLA: A Unified Vision-Language-Action Model with Adaptive Reasoning"

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published