Skip to content

IRMVLab/MMTwin

Repository files navigation

MMTwin: Novel Diffusion Models for Multimodal 3D Hand Trajectory Prediction

arXiv Project Page Under Review

Junyi Ma1, Wentao Bao2, Jingyi Xu1, Guanzhong Sun3, Xieyuanli Chen4, Hesheng Wang1

1 Shanghai Jiao Tong University 2 Meta Reality Labs 3 China University of Mining and Technology 4 National University of Defense Technology

🟢 Green: past waypoints | 🔴 Red: predicted waypoints | 🔵 Blue: GT waypoints

This work has been accepted by IROS 2025 🎉 We are updating the tutorials. Your patience is appreciated :)

TODO

  • Release the paper :bowtie:
  • Release our self-collected CABH Benchmark for fast HTP evaluation 😎
  • Release the code and pretrained models on EgoPAT3D
  • Release the code and pretrained models on our CABH benchmark

Suggested Data Structure

📁 We strongly recommend following the default data structure for fast reproducing. [Click to expand]
/data/HTPdata/
|-- EgoPAT3D-postproc
 |-- odometry
 |-- trajectory_repair
 |-- video_clips_hand
 |-- pointcloud_bathroomCabinet_1  # for demo
 |-- glip_feats
 |-- motion_feats
 |-- egopat_voxel_filtered
|-- CABH-benchmark
 |-- redcup
 |-- hand_data_for_pipeline_mask_redcup
 |-- glip_feats_redcup
 |-- motion_feats_redcup
 |-- train_split.txt
 |-- test_split.txt
 |-- redapple
 |-- box

Benchmark Evaluation

EgoPAT3D

0. Data Preprocessing (optional)

🔧 We provide some scripts to process raw data manually. [Click to expand]
  • 0.1 Camera Egomotion Generation

Please refer to the config file preprocess/CamEgoGen/ceg.yml.

cd preprocess/CamEgoGen
python generate_homography_offline.py
  • 0.2 Vision-Language Feature Extraction

Please clone the original GLIP repo and merge it to VLExtraction by

cd preprocess
git clone https://github.com/microsoft/GLIP
rsync -a --progress GLIP/ VLExtraction/ 
cd VLExtraction

Then install the requirements of GLIP and modify its source code to collect vision-language fusion features as follows:

1. maskrcnn_benchmark/modeling/detector/generalized_vl_rcnn.py
def forward(self, 
 ...
 return result
->    return result, fused_visual_features

2. maskrcnn_benchmark/engine/predictor_glip.py
def compute_prediction(self, original_image, 
 ...
 predictions = self.model(image_list, ...
->    predictions, visual_features = self.model(image_list, ...

3. maskrcnn_benchmark/engine/predictor_glip.py
def run_on_web_image(self, 
 ...
 predictions = self.compute_prediction(original_image, ...
->    predictions, visual_features = self.compute_prediction(original_image, ...

After modifying the params in preprocess/VLExtraction/vle.yml, you can use this script to generate GLIP features for all the videos in EgoPAT3D-DT:

python generate_homography_offline.py

Alternatively, please download the features we have produced.

  • 0.3 Point Cloud Aggregation

We transform sequential point clouds into a unified reference frame for voxelization. Here is a demo to aggregate them. Please refer to the config file preprocess/PC2Voxel/p2v.yml.

cd preprocess/PC2Voxel
python generate_occupancy_offline.py

This is just a demo to aggregate depth points. You can also use the point clouds processed with arm masks (0.4) as inputs. Notable, our main code can automatically achieve this and save the results to the required voxel files.

Alternatively, you can download the required voxel files we have produced.

  • 0.4 Arm Filtering for Clean Global Context

We use MobileSAM to efficiently filter our arm point clouds for clean 3D global context. Please install the environments according to this repo. Our repo has accommodated the MobileSAM repo, and you can download the weights here and put it to weights folder. Remember to modify the params in preprocess/MobileSAM/ms.yml.

cd preprocess/MobileSAM
python demo_arm_pc_filter.py

If you want to loop all the data in EgoPAT3D-DT, please run

cd preprocess/MobileSAM
python loop_arm_pc_filter_egopat3d.py

⬇️ You can directly download all our preprocessed files as follows:

Description Link Config
EgoPAT3D-DT from USST EgoPAT3D-postproc EgoPAT3D-postproc for options/expopts.py
GLIP features glip_feats glip_feats_path for options/expopts.py
camera egomotion motion_feats motion_feats_path for options/expopts.py
occupancy voxel grids egopat_voxel_filtered voxel_path for options/expopts.py
MobileSAM weights mobile_sam.pt weights for preprocess/MobileSAM
point cloud example from raw EgoPAT3D pointcloud_bathroomCabinet_1 examples for preprocess/PC2Voxel

1. Test MMTwin on EgoPAT3D-DT

Firstly, set the GPU number and the checkpoint file in options/traineval_config.yml. Then you can evaluate the performance on EgoPAT3D-DT by

bash val_traj.sh

Note that you can modify the configurations of models and experiments in options/expopts.py. For instance, to test MMTwin on seen scenes in 2D space, please set test_novel=False and test_space="2d".

We have released the pretrained MMTwin models in this link. Feel free to download and put it in ./mmtwin_weights/.

2. Train MMTwin on EgoPAT3D-DT

We noticed that the performance gain from MHSA is marginal, so we omit it in this repo to improve computational efficiency. More versions will be released soon. To optimize MMTwin from scratch, simply run

bash train.sh

CABH Benchmark

We collected multiple egocentric videos capturing human hands performing simple object manipulation tasks. This benchmark enables rapid validation of the potential of human hand trajectory prediction models for downstream manipulation applications.​​

Past trajectories are shown in green, and MMTwin’s predicted future trajectories are displayed in red. The direct detection results from the visual grounding model on future frames are visualized in blue. As can be seen, MMTwin demonstrates performance comparable to the visual grounding model even if the visual grounding model can "look into the future".

⬇️ Feel free to download the raw/preprocessed data of CABH benchmark.

Task Description Link (raw) Link (preprocessed) Link (GLIP feats) Link (motion info) Link (train/test splits)
1 place the cup on the coaster hand_data_red_cup hand_data_for_pipeline_mask_redcup glip_feats_redcup motion_feats_redcup train_split/test_split
2 put the apple on the plate hand_data_red_apple hand_data_for_pipeline_mask_redapple glip_feats_redapple motion_feats_redapple train_split/test_split
3 place the box on the shelf hand_data_box hand_data_for_pipeline_mask_box glip_feats_box motion_feats_box train_split/test_split
  • Link (raw): Raw RGB and depth images from headset realsense D435i.
  • Link (preprocessed): Preprocessed data for HTP. Please refer to read_cabh.ipynb for more details.
  • Link (GLIP feats): Vision-language features extracted by GLIP, a powerful visual grounding model.
  • Link (motion info): Camera egomotion homography.
  • Link (train/test splits): The dataset splits used for training and evaluation.

The implementation of MMTwin for our CABH benchmark will be released soon.

🤝 If our work is helpful to your research, we would appreciate a citation to our paper:

@misc{ma2025mmtwin,
 title={Novel Diffusion Models for Multimodal 3D Hand Trajectory Prediction}, 
 author={Junyi Ma and Wentao Bao and Jingyi Xu and Guanzhong Sun and Xieyuanli Chen and Hesheng Wang},
 year={2025},
 eprint={2504.07375},
 archivePrefix={arXiv},
 primaryClass={cs.CV},
 url={https://arxiv.org/abs/2504.07375}, 
}

Prior Works

We have released Diff-IP2D, a basic 2D HOI prediction approach. Its open-source code presents how to implement 2D hand trajectory prediction in the EPIC-KITCHENS dataset. Feel free to try it!

Acknowledgements

We gratefully acknowledge the inspiring work of DiffuSeq, hoi-forecast, USST, mamba.py and other valuable contributions from the community.

About

[IROS 2025] Novel Diffusion Models for Multimodal 3D Hand Trajectory Prediction

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published