DexVLA: Vision-Language Model with Plug-In Diffusion Expert for Visuomotor Policy Learning

DexVLA: Vision-Language Model with Plug-In Diffusion Expert for Visuomotor Policy Learning

📰 News

Mar. 11th, 2025: Add a script for training DiVLA.
Feb. 24th, 2025: We released our Stage 1 trained ScaleDP-H !!!
Feb. 17th, 2025: DexVLA is out! Paper can be found here. The project web can be found here.

Install

Clone this repository and navigate to diffusion-vla folder

git clone https://github.com/juruobenruo/dexvla.git

Install Packages

conda create -n dexvla python=3.10 -y
conda activate dexvla
pip install --upgrade pip  # 
pip install -r requirements.txt
cd policy_heads
pip install -e .

For training acceleration, please install flash_attention.

pip install flash-attn --no-build-isolation

Data Preparation

We provide an example data here. You can download it and run the whole pipeline quickly.

Our data format is the same as act, so you need to transfer your data into h5py format. You can refer to function "generate_h5" in data_preprocess_scripts/rlds_to_h5py.py which is used to transfer the data from rlds format to h5py format.

# h5 data structure
root
  |-action (100,10)
  |-language_raw (1,)
  |-substep_reasonings (100,)
  |-observations
      |-images # multi-view
          |-left (100,480,640,3)
          |-right (100,480,640,3)
          |-wrist (100,480,640,3)
      |-joint_positions (100,7)
      |-qpos (100,7)
      |-qvel (100,7)

You have to add one entry in constants.py to specify the path of your data as follows.

    'example_task_name': { # for local debug
        'dataset_dir': [
            '/path/to/task1', # define the path of the dataset
        ],
        'episode_len': 1000,  
        'camera_names': ['left', 'right', 'wrist'] # keys corresponding to below h5 data structure
    }

🤗Download Pretrained Weights

Download official Qwen2_VL weights

We construct the VLM backbone by integrating Qwen2-VL-2B, a powerful and efficient model, into our framework. The Qwen2-VL 2B serves as the core of our architecture, providing robust capabilities for vision-language tasks. We use off-the-shelf Qwen2-VL model proposed in Qwen2-VL without any post training on VLM itself. You can download the official weights from this link:

Model	Link
Qwen2-VL (~2B)	huggingface

❗❗ After downloading the standard weights, you have to replace the official "config.json" with our "docs/config.json" designed for VLA.

Download our pretrained ScaleDP-H weights(Stage 1)

We released our pretrained weights of ScaleDP-H which is trained after Stage1. Now you can download the weights and directly finetuning your data on Stage 2.

Model	Link
ScaleDP-H (~1B)	huggingface
ScaleDP-L (~400M)	huggingface

🦾Train

The training script are "scripts/stage2_train.sh" and "scripts/stage3_train.sh". And you need to change following parameters:

OUTPUT :refers to the save directory for training, which must include the keyword "qwen2"(and optionally "lora"). If LoRA training is used, the name must include "lora" (e.g., "qwen2_lora").
task_name :refers to the tasks used for training, which should be corresponded to "your_task_name" in aloha_scripts/constant.py
model_name_or_path :path to the pretrained VLM weights

Other hyperparameters like "batch_size", "save_steps" could be customized according to your computation resources. Start training by following commands:

Train stage2. Training on large amount of tasks. And following hyper-parameters must be set as:

load_pretrain_dit : True
DIT_PRETRAIN :Path to pretrained policy head(ScaleDP).
MNOP :Path to official Qwen2_vl weights(VLM backbone).

./scripts/train_dexvla_stage2.sh

Train stage3. Post-training on target dexterous tasks. And following hyper-parameters must be set as:

MNOP :Path to trained DexVLA of Stage2.

./scripts/train_dexvla_stage3.sh

Evaluation

❗❗ Make sure your trained checkpoint dir has two files: "preprocessor_config.json" and "chat_template.json". If not, please copy them from downloaded Qwen2_vl weights or this link.

You can refer to our evaluation script smart_eval_agilex.py to evaluate your DexVLA.

⚠️ Trouble Shooting

1."TypeError: _batch_encode_plus() got an unexpected keyword argument 'images'".

Copy "preprocessor_config.json" and "chat_template.json" into your own trained DexVLA dir. And must be put in target "checkpoint-XXXX" dir.

Traceback (most recent call last):
  File "/media/rl/HDD/projects/open_dexvla_preview/train_vla.py", line 320, in <module>
    main(all_config=config, model_config=model_config)
  File "/media/rl/HDD/projects/open_dexvla_preview/train_vla.py", line 282, in main
    train_dataset, val_dataset, stats, sampler_params = load_data(dataset_dir, name_filter, camera_names, all_config['training_args'].per_device_train_batch_size,
  File "/media/rl/HDD/projects/open_dexvla_preview/data_utils/utils.py", line 337, in load_data
    train_dataset = EpisodicDataset(dataset_path_list, camera_names, norm_stats, train_episode_ids, train_episode_len, chunk_size, policy_class, robot=robot, llava_pythia_process=llava_pythia_process, data_args=config['data_args'])
  File "/media/rl/HDD/projects/open_dexvla_preview/data_utils/utils.py", line 43, in __init__
    a=self.__getitem__(0) # initialize self.is_sim and self.transformations
  File "/media/rl/HDD/projects/open_dexvla_preview/data_utils/utils.py", line 191, in __getitem__
    return self.llava_pythia_process.forward_process(sample, use_reasoning=self.data_args.use_reasoning)
  File "/media/rl/HDD/projects/open_dexvla_preview/qwen2_vla/utils/robot_data_processor.py", line 87, in forward_process
    model_inputs = self.multimodal_processor(
  File "/home/rl/miniconda3/envs/opendexvla/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 3016, in __call__
    encodings = self._call_one(text=text, text_pair=text_pair, **all_kwargs)
  File "/home/rl/miniconda3/envs/opendexvla/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 3126, in _call_one
    return self.encode_plus(
  File "/home/rl/miniconda3/envs/opendexvla/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 3202, in encode_plus
    return self._encode_plus(
  File "/home/rl/miniconda3/envs/opendexvla/lib/python3.8/site-packages/transformers/tokenization_utils_fast.py", line 603, in _encode_plus
    batched_output = self._batch_encode_plus(
TypeError: _batch_encode_plus() got an unexpected keyword argument 'images'

2. CUDA OOM.

For OOM problem, we provide three ways to save CUDA memory. You can use only one solution or all of them. And here we listed the training speed, GPU memory for all three solutions. Notably, all results are evaluated on single A6000(46G) with batch_size 2.

❗Notably, deepspeed may takes more GPU memory on single gpu with zero2 optimization.

Script	DeepSpeed offload	LoRA VLM	Smaller ScaleDP	training speed	CUDA memory
local_debug_deepspeed.sh	✔️	-	-	6.56s/iter	20-29G
local_debug_python.sh	-	✔️	-	1.09s/iter	24G
local_debug_python.sh	-	-	✔️	1.01s/iter	33G
local_debug_python.sh	-	-	-	1.1s/iter	38G

Deepspeed offload

Deepspeed allows to offload optimizer part to cpu which saves a lot of cuda memory. You can enbale the offload by adding following part in scripts/zero2.json. Please make sure your GCC version > 9.

    "zero_optimization": {
        "stage": 2,
        "overlap_comm": true,
        "contiguous_gradients": true,
        "sub_group_size": 1e9,
        "reduce_bucket_size": "auto",
        //##Adding this part to zero2.json###
        "offload_optimizer": {
            "device": "cpu",
            "pin_memory": true
        }
        //###################################
    },

LoRA Finetune

Our scripts facilitate LoRA (Low-Rank Adaptation) fine-tuning of the Vision-Language Model (VLM) backbone. This approach is effective in reducing GPU memory usage. Meanwhile, the policy head continues to undergo full parameter training.

To enable LoRA, you can set the following hyperparameters within the training scripts:

  ...
  --lora_enable True \
  ...
  --freeze_vision_tower True \
  --freeze_backbone True \
  ...

Notice: After LoRA finetune, you need to process the checkpoint files as follows:

cd /path/to/finetuned/dir/checkpoint-xxxx
python ./zero_to_fp32.py ./ ./non_lora_trainables.bin

For evaluation, you have to specify following arguments in evaluate/smart_eval_agilex.py:

"model_base": None, # path to base model
"enable_lora": True,

Smaller ScaleDP

Our DexVLA consists of two parts: the VLM backbone and the ScaleDP policy head. In our paper, we utilize a 1B - sized ScaleDP. Additionally, we recommend that users employ a smaller one, such as a 410M - sized ScaleDP, to save memory.

By setting the following hyperparameters:

  ...
  --policy_head_size "ScaleDP_L" \
  ...

3. Action value is Nan during inference which happens at the last denoising in "policy_heads/models/transformer_diffusion/modeling_dit_diffusion.py".

This is a precision overflow problem in "DDIMScheduler" from diffusers.schedulers.scheduling_ddim. The easiest way is adding a line in "scheduling_ddim.py"

        ###other code###
        else:
            raise NotImplementedError(f"{beta_schedule} does is not implemented for {self.__class__}")
        
        ### newly added ###############################
        self.betas = self.betas.to(dtype=torch.float32)
        ###############################################
        self.alphas = 1.0 - self.betas
        self.alphas_cumprod = torch.cumprod(self.alphas, dim=0)
        ###other code###

4. Robot performs random actions when evaluation.

This is a bug in evaluation which not affect the training process. Sorry about that and we have fixed this in 23e29e1.

Diffusion-VLA

Our DexVLA is built on Diffusion-VLA(DiVLA) which can be found here. Paper can be found in Citation. You can train Diffusion-VLA with "./scripts/train_divla.sh". The mainly differences are as follows:

DiVLA utilizes Unet-based diffusion policy as policy head of VLA.
DiVLA has no three-stage training recipe.

ScaleDP

DexVLA utilizes ScaleDP as diffusion policy head that the main structure of ScaleDP can be found here. Paper can be found in Citation. The code can be found in this dir. There are only two files, one for configuration and the other is model structure.

Acknowledgement

We build our project based on:

LLaVA: an amazing open-sourced project for vision language assistant
act-plus-plus: an amazing open-sourced project for robotics visuomotor learning
Miphi: an amazing open-sourced project for tiny vision language model

Citation

# DexVLA
@article{wen2025dexvla,
  title={DexVLA: Vision-Language Model with Plug-In Diffusion Expert for General Robot Control},
  author={Wen, Junjie and Zhu, Yichen and Li, Jinming and Tang, Zhibin and Shen, Chaomin and Feng, Feifei},
  journal={arXiv preprint arXiv:2502.05855},
  year={2025}
}

# Diffusion-VLA
@article{wen2024diffusion,
  title={Diffusion-VLA: Scaling Robot Foundation Models via Unified Diffusion and Autoregression},
  author={Wen, Junjie and Zhu, Minjie and Zhu, Yichen and Tang, Zhibin and Li, Jinming and Zhou, Zhongyi and Li, Chengmeng and Liu, Xiaoyu and Peng, Yaxin and Shen, Chaomin and others},
  journal={arXiv preprint arXiv:2412.03293},
  year={2024}
}

# ScaleDP
@article{zhu2024scaling,
  title={Scaling diffusion policy in transformer to 1 billion parameters for robotic manipulation},
  author={Zhu, Minjie and Zhu, Yichen and Li, Jinming and Wen, Junjie and Xu, Zhiyuan and Liu, Ning and Cheng, Ran and Shen, Chaomin and Peng, Yaxin and Feng, Feifei and others},
  journal={arXiv preprint arXiv:2409.14411},
  year={2024}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

DexVLA: Vision-Language Model with Plug-In Diffusion Expert for Visuomotor Policy Learning

📰 News

Contents

Install

Data Preparation

🤗Download Pretrained Weights

Download official Qwen2_VL weights

Download our pretrained ScaleDP-H weights(Stage 1)

🦾Train

Evaluation

⚠️ Trouble Shooting

1."TypeError: _batch_encode_plus() got an unexpected keyword argument 'images'".

2. CUDA OOM.

Deepspeed offload

LoRA Finetune

Smaller ScaleDP

3. Action value is Nan during inference which happens at the last denoising in "policy_heads/models/transformer_diffusion/modeling_dit_diffusion.py".

4. Robot performs random actions when evaluation.

Diffusion-VLA

ScaleDP

Acknowledgement

Citation

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 51 Commits
aloha_scripts		aloha_scripts
data_preprocess_scripts		data_preprocess_scripts
data_utils		data_utils
docs		docs
evaluate		evaluate
policy_heads		policy_heads
qwen2_vla		qwen2_vla
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py
torch_utils.py		torch_utils.py
train_vla.py		train_vla.py

License

juruobenruo/DexVLA

Folders and files

Latest commit

History

Repository files navigation

DexVLA: Vision-Language Model with Plug-In Diffusion Expert for Visuomotor Policy Learning

📰 News

Contents

Install

Data Preparation

🤗Download Pretrained Weights

Download official Qwen2_VL weights

Download our pretrained ScaleDP-H weights(Stage 1)

🦾Train

Evaluation

⚠️ Trouble Shooting

1."TypeError: _batch_encode_plus() got an unexpected keyword argument 'images'".

2. CUDA OOM.

Deepspeed offload

LoRA Finetune

Smaller ScaleDP

3. Action value is Nan during inference which happens at the last denoising in "policy_heads/models/transformer_diffusion/modeling_dit_diffusion.py".

4. Robot performs random actions when evaluation.

Diffusion-VLA

ScaleDP

Acknowledgement

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages