- [2025.06.23] Release models, training code and evaluation code on LIBERO action generation benchmark of WorldVLA.
WorldVLA is an autoregressive action world model that unifies action and image understanding and generation. WorldVLA intergrates Vision-Language-Action (VLA) model (action model) and world model in one single framework.
Action Model generates actions given the text instruction and image observations.
![]() |
![]() |
![]() |
Input: Open the middle drawer of the cabinet. | Input: Pick up the alphabet soup and place it in the basket. | Input: Pick up the black bowl between the plate and the ramekin and place it on the plate. |
World Model generates the next frame given the current frame and action control.
![]() |
![]() |
![]() |
Input: Action sequence of "Open the top drawer and put the bowl inside". | Input: Action sequence of "Push the plate to the front of the stove". | Input: Action sequence of "Put the bowl on the stove". |
conda env create -f environment.yml
git clone https://github.com/Lifelong-Robot-Learning/LIBERO.git
cd LIBERO
pip install -e .
Model (256 * 256) | HF Link | Success Rate (%) |
---|---|---|
LIBERO-Spatial | Alibaba-DAMO-Academy/WorldVLA/model_256/libero_spatial | 85.6 |
LIBERO-Object | Alibaba-DAMO-Academy/WorldVLA/model_256/libero_object | 89.0 |
LIBERO-Goal | Alibaba-DAMO-Academy/WorldVLA/model_256/libero_goal | 82.6 |
LIBERO-Long | Alibaba-DAMO-Academy/WorldVLA/model_256/libero_10 | 59.0 |
Model (512 * 512) | HF Link | Success Rate (%) |
---|---|---|
LIBERO-Spatial | Alibaba-DAMO-Academy/WorldVLA/model_512/libero_spatial | 87.6 |
LIBERO-Object | Alibaba-DAMO-Academy/WorldVLA/model_512/libero_object | 96.2 |
LIBERO-Goal | Alibaba-DAMO-Academy/WorldVLA/model_512/libero_goal | 83.4 |
LIBERO-Long | Alibaba-DAMO-Academy/WorldVLA/model_512/libero_10 | 60.0 |
We evaluate four tasks of the LIBERO benchmark, including [spatial, obejct, goal, 10], and 2 image resolutions, including [256, 512]. Here we take LIEBRO goal and 256 resolution as an example.
First, filter the no-operation actions like OpenVLA.
cd worldvla/libero_util
python regenerate_libero_dataset_filter_no_op.py \
--libero_task_suite libero_goal \
--libero_raw_data_dir ../processed_data/Libero/libero_goal \
--libero_target_dir ../processed_data/libero_goal_no_noops_t_256 \
--image_resolution 256
Then, save all images and actions.
python regenerate_libero_dataset_save_img_action.py \
--libero_task_suite libero_goal \
--raw_data_dir ../processed_data/libero_goal_no_noops_t_256 \
--save_dir ../processed_data/libero_goal_img_action_256
Next, generate the conversations data for the Chameleon model. The action model conversations are in the following format:
{
"conversations": [
{
"from": "human",
"value": "What action should the robot take to open the middle drawer of the cabinet?<|image|><|image|>"
},
{
"from": "gpt",
"value": "<|action|><|action|><|action|><|action|><|action|>"
}
],
"image": [
"../processed_data/libero_goal_img_action_256/open_the_middle_drawer_of_the_cabinet/trj_0/imgs/image_0.png",
"../processed_data/libero_goal_img_action_256/open_the_middle_drawer_of_the_cabinet/trj_0/imgs/image_1.png"
],
"action": [
"../processed_data/libero_goal_img_action_256/open_the_middle_drawer_of_the_cabinet/trj_0/action/action_1.npy",
"../processed_data/libero_goal_img_action_256/open_the_middle_drawer_of_the_cabinet/trj_0/action/action_2.npy",
"../processed_data/libero_goal_img_action_256/open_the_middle_drawer_of_the_cabinet/trj_0/action/action_3.npy",
"../processed_data/libero_goal_img_action_256/open_the_middle_drawer_of_the_cabinet/trj_0/action/action_4.npy",
"../processed_data/libero_goal_img_action_256/open_the_middle_drawer_of_the_cabinet/trj_0/action/action_5.npy"
]
}
The world model conversations are in the following format:
{
"conversations": [
{
"from": "human",
"value": "Generate the next image based on the provided sequence of historical images and corresponding actions.<|image|><|action|>"
},
{
"from": "gpt",
"value": "<|image|>"
}
],
"image": [
"../processed_data/libero_goal_img_action_256/open_the_middle_drawer_of_the_cabinet/trj_5/imgs/image_0.png",
"../processed_data/libero_goal_img_action_256/open_the_middle_drawer_of_the_cabinet/trj_5/imgs/image_1.png"
],
"action": [
"../processed_data/libero_goal_img_action_256/open_the_middle_drawer_of_the_cabinet/trj_5/action/action_0.npy"
]
},
To validate the world model performance, we split all the libero dataset into train/val_ind/val_ood json files.
cd worldvla/data
python action_model_conv_generation.py \
--base_dir ../processed_data/libero_goal_img_action_256 \
--his 2 \
--len_action 5 \
--task_name goal \
--resolution 256 \
--output_dir ../processed_data/convs
python world_model_conv_generation.py \
--base_dir ../processed_data/libero_goal_img_action_256 \
--his 1 \
--task_name goal \
--resolution 256 \
--output_dir ../processed_data/convs
Finally, tokenize all the conversations into tokens and save them.
cd worldvla/data
python pretoken.py --task goal --resolution 256
./concate_record.sh
python concate_action_world_model_data.py --task goal --resolution 256
Set the correct data path in the config files in worldvla/configs/libero_256_all
, worldvla/exps_512_all
.
Download the Chameleon tokenizer and starting point weights, put them under the worldvla/ckpts/chameleon/tokenizer
and worldvla/ckpts/starting_point
.
Now you can start training with your training scripts:
# Libero goal, 256 resolution
cd worldvla/exps_256_all
bash 7B_ts_his_2_img_only_goal_ck_5_1a2i_all.sh
# Libero goal, 512 resolution
cd worldvla/exps_512_all
bash 7B_ts_his_2_img_only_goal_ck_5_1a2i_all.sh
Set the --resume_path
in worldvla/exps_256_all/eval_libero_7B_2_action_all_epochs_img_only_ck_5_1a2i_goal.sh
to the model path. You can download our trained in Model Zoo or train yourself.
# Libero goal, 256 resolution
cd worldvla/exps_256_all
bash eval_libero_7B_2_action_all_epochs_img_only_ck_5_1a2i_goal.sh
# Libero goal, 512 resolution
cd worldvla/exps_512_all
bash eval_libero_7B_2_action_all_epochs_img_only_ck_5_1a2i_goal.sh
- Release the code of action model on LIBERO benchmark.
- Release the code of world model on LIBERO dataset.
- Release the code of real-world expriment.
All assets and code are under the Apache 2.0 license unless specified otherwise.
If you find the project helpful for your research, please consider citing our paper:
@article{cen2025worldvla,
title={WorldVLA: Towards Autoregressive Action World Model},
author={Cen, Jun and Yu, Chaohui and Yuan, Hangjie and Jiang, Yuming and Huang, Siteng and Guo, Jiayan and Li, Xin and Song, Yibing and Luo, Hao and Wang, Fan and others},
journal={arXiv preprint arXiv:2506.21539},
year={2025}
}
This project builds upon Lumina-mGPT, Chemeleon, and OpenVLA. We thank these teams for their open-source contributions.