GitHub - alibaba-damo-academy/WorldVLA: WorldVLA: Towards Autoregressive Action World Model

WorldVLA: Towards Autoregressive Action World Model

If our project helps you, please give us a star ⭐ on GitHub to support us. 🙏🙏

📰 News

[2025.06.23] Release models, training code and evaluation code on LIBERO action generation benchmark of WorldVLA.

🌟 Introduction

WorldVLA is an autoregressive action world model that unifies action and image understanding and generation. WorldVLA intergrates Vision-Language-Action (VLA) model (action model) and world model in one single framework.

Action Model Results (Text + Image -> Action)

Action Model generates actions given the text instruction and image observations.


Input: Open the middle drawer of the cabinet.	Input: Pick up the alphabet soup and place it in the basket.	Input: Pick up the black bowl between the plate and the ramekin and place it on the plate.

World Model Results (Action + Image -> Image)

World Model generates the next frame given the current frame and action control.


Input: Action sequence of "Open the top drawer and put the bowl inside".	Input: Action sequence of "Push the plate to the front of the stove".	Input: Action sequence of "Put the bowl on the stove".

🛠️ Requirements and Installation

conda env create -f environment.yml
git clone https://github.com/Lifelong-Robot-Learning/LIBERO.git
cd LIBERO
pip install -e .

🌎 Model Zoo

Model (256 * 256)	HF Link	Success Rate (%)
LIBERO-Spatial	Alibaba-DAMO-Academy/WorldVLA/model_256/libero_spatial	85.6
LIBERO-Object	Alibaba-DAMO-Academy/WorldVLA/model_256/libero_object	89.0
LIBERO-Goal	Alibaba-DAMO-Academy/WorldVLA/model_256/libero_goal	82.6
LIBERO-Long	Alibaba-DAMO-Academy/WorldVLA/model_256/libero_10	59.0

Model (512 * 512)	HF Link	Success Rate (%)
LIBERO-Spatial	Alibaba-DAMO-Academy/WorldVLA/model_512/libero_spatial	87.6
LIBERO-Object	Alibaba-DAMO-Academy/WorldVLA/model_512/libero_object	96.2
LIBERO-Goal	Alibaba-DAMO-Academy/WorldVLA/model_512/libero_goal	83.4
LIBERO-Long	Alibaba-DAMO-Academy/WorldVLA/model_512/libero_10	60.0

🗝️ Training

Step 1: Libero Data Preparation

We evaluate four tasks of the LIBERO benchmark, including [spatial, obejct, goal, 10], and 2 image resolutions, including [256, 512]. Here we take LIEBRO goal and 256 resolution as an example.

First, filter the no-operation actions like OpenVLA.

cd worldvla/libero_util
python regenerate_libero_dataset_filter_no_op.py \
    --libero_task_suite libero_goal \
    --libero_raw_data_dir ../processed_data/Libero/libero_goal \
    --libero_target_dir ../processed_data/libero_goal_no_noops_t_256 \
    --image_resolution 256

Then, save all images and actions.

python regenerate_libero_dataset_save_img_action.py \
    --libero_task_suite libero_goal \
    --raw_data_dir ../processed_data/libero_goal_no_noops_t_256 \
    --save_dir ../processed_data/libero_goal_img_action_256

Next, generate the conversations data for the Chameleon model. The action model conversations are in the following format:

{
  "conversations": [
    {
      "from": "human",
      "value": "What action should the robot take to open the middle drawer of the cabinet?<|image|><|image|>"
    },
    {
      "from": "gpt",
      "value": "<|action|><|action|><|action|><|action|><|action|>"
    }
  ],
  "image": [
    "../processed_data/libero_goal_img_action_256/open_the_middle_drawer_of_the_cabinet/trj_0/imgs/image_0.png",
    "../processed_data/libero_goal_img_action_256/open_the_middle_drawer_of_the_cabinet/trj_0/imgs/image_1.png"
  ],
  "action": [
    "../processed_data/libero_goal_img_action_256/open_the_middle_drawer_of_the_cabinet/trj_0/action/action_1.npy",
    "../processed_data/libero_goal_img_action_256/open_the_middle_drawer_of_the_cabinet/trj_0/action/action_2.npy",
    "../processed_data/libero_goal_img_action_256/open_the_middle_drawer_of_the_cabinet/trj_0/action/action_3.npy",
    "../processed_data/libero_goal_img_action_256/open_the_middle_drawer_of_the_cabinet/trj_0/action/action_4.npy",
    "../processed_data/libero_goal_img_action_256/open_the_middle_drawer_of_the_cabinet/trj_0/action/action_5.npy"
  ]
}

The world model conversations are in the following format:

{
  "conversations": [
      {
          "from": "human",
          "value": "Generate the next image based on the provided sequence of historical images and corresponding actions.<|image|><|action|>"
      },
      {
          "from": "gpt",
          "value": "<|image|>"
      }
  ],
  "image": [
      "../processed_data/libero_goal_img_action_256/open_the_middle_drawer_of_the_cabinet/trj_5/imgs/image_0.png",
      "../processed_data/libero_goal_img_action_256/open_the_middle_drawer_of_the_cabinet/trj_5/imgs/image_1.png"
  ],
  "action": [
      "../processed_data/libero_goal_img_action_256/open_the_middle_drawer_of_the_cabinet/trj_5/action/action_0.npy"
  ]
},

To validate the world model performance, we split all the libero dataset into train/val_ind/val_ood json files.

cd worldvla/data
python action_model_conv_generation.py \
    --base_dir ../processed_data/libero_goal_img_action_256 \
    --his 2 \
    --len_action 5 \
    --task_name goal \
    --resolution 256 \
    --output_dir ../processed_data/convs
python world_model_conv_generation.py \
    --base_dir ../processed_data/libero_goal_img_action_256 \
    --his 1 \
    --task_name goal \
    --resolution 256 \
    --output_dir ../processed_data/convs

Finally, tokenize all the conversations into tokens and save them.

cd worldvla/data
python pretoken.py --task goal --resolution 256
./concate_record.sh
python concate_action_world_model_data.py --task goal --resolution 256

Step 2: Prepare data configs

Set the correct data path in the config files in worldvla/configs/libero_256_all, worldvla/exps_512_all.

Step 3: Download the Chameleon weights

Download the Chameleon tokenizer and starting point weights, put them under the worldvla/ckpts/chameleon/tokenizer and worldvla/ckpts/starting_point.

Step 4: Start training

Now you can start training with your training scripts:

# Libero goal, 256 resolution
cd worldvla/exps_256_all
bash 7B_ts_his_2_img_only_goal_ck_5_1a2i_all.sh
# Libero goal, 512 resolution
cd worldvla/exps_512_all
bash 7B_ts_his_2_img_only_goal_ck_5_1a2i_all.sh

✅ Evaluation

Step 1: Prepare evaluation scripts

Set the --resume_path in worldvla/exps_256_all/eval_libero_7B_2_action_all_epochs_img_only_ck_5_1a2i_goal.sh to the model path. You can download our trained in Model Zoo or train yourself.

Step 2: Start evaluation

# Libero goal, 256 resolution
cd worldvla/exps_256_all
bash eval_libero_7B_2_action_all_epochs_img_only_ck_5_1a2i_goal.sh
# Libero goal, 512 resolution
cd worldvla/exps_512_all
bash eval_libero_7B_2_action_all_epochs_img_only_ck_5_1a2i_goal.sh

📆 TODO

Release the code of action model on LIBERO benchmark.
Release the code of world model on LIBERO dataset.
Release the code of real-world expriment.

License

All assets and code are under the Apache 2.0 license unless specified otherwise.

Citation

If you find the project helpful for your research, please consider citing our paper:

@article{cen2025worldvla,
  title={WorldVLA: Towards Autoregressive Action World Model},
  author={Cen, Jun and Yu, Chaohui and Yuan, Hangjie and Jiang, Yuming and Huang, Siteng and Guo, Jiayan and Li, Xin and Song, Yibing and Luo, Hao and Wang, Fan and others},
  journal={arXiv preprint arXiv:2506.21539},
  year={2025}
}

Acknowledgment

This project builds upon Lumina-mGPT, Chemeleon, and OpenVLA. We thank these teams for their open-source contributions.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
worldvla		worldvla
xllmx		xllmx
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
environment.yml		environment.yml
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

WorldVLA: Towards Autoregressive Action World Model

If our project helps you, please give us a star ⭐ on GitHub to support us. 🙏🙏

📰 News

🌟 Introduction

Action Model Results (Text + Image -> Action)

World Model Results (Action + Image -> Image)

🛠️ Requirements and Installation

🌎 Model Zoo

🗝️ Training

Step 1: Libero Data Preparation

Step 2: Prepare data configs

Step 3: Download the Chameleon weights

Step 4: Start training

✅ Evaluation

Step 1: Prepare evaluation scripts

Step 2: Start evaluation

📆 TODO

License

Citation

Acknowledgment

About

Uh oh!

Releases

Packages

Languages

License

alibaba-damo-academy/WorldVLA

Folders and files

Latest commit

History

Repository files navigation

WorldVLA: Towards Autoregressive Action World Model

If our project helps you, please give us a star ⭐ on GitHub to support us. 🙏🙏

📰 News

🌟 Introduction

Action Model Results (Text + Image -> Action)

World Model Results (Action + Image -> Image)

🛠️ Requirements and Installation

🌎 Model Zoo

🗝️ Training

Step 1: Libero Data Preparation

Step 2: Prepare data configs

Step 3: Download the Chameleon weights

Step 4: Start training

✅ Evaluation

Step 1: Prepare evaluation scripts

Step 2: Start evaluation

📆 TODO

License

Citation

Acknowledgment

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages