MMTwin: Novel Diffusion Models for Multimodal 3D Hand Trajectory Prediction

Junyi Ma¹, Wentao Bao², Jingyi Xu¹, Guanzhong Sun³, Xieyuanli Chen⁴, Hesheng Wang¹

¹ Shanghai Jiao Tong University ² Meta Reality Labs ³ China University of Mining and Technology ⁴ National University of Defense Technology

🟢 Green: past waypoints | 🔴 Red: predicted waypoints | 🔵 Blue: GT waypoints

This work has been accepted by IROS 2025 🎉 We are updating the tutorials. Your patience is appreciated :)

TODO

Release the paper
Release our self-collected CABH Benchmark for fast HTP evaluation 😎
Release the code and pretrained models on EgoPAT3D
Release the code and pretrained models on our CABH benchmark

Suggested Data Structure

📁 We strongly recommend following the default data structure for fast reproducing. [Click to expand]

/data/HTPdata/
|-- EgoPAT3D-postproc
 |-- odometry
 |-- trajectory_repair
 |-- video_clips_hand
 |-- pointcloud_bathroomCabinet_1  # for demo
 |-- glip_feats
 |-- motion_feats
 |-- egopat_voxel_filtered
|-- CABH-benchmark
 |-- redcup
 |-- hand_data_for_pipeline_mask_redcup
 |-- glip_feats_redcup
 |-- motion_feats_redcup
 |-- train_split.txt
 |-- test_split.txt
 |-- redapple
 |-- box

Benchmark Evaluation

EgoPAT3D

0. Data Preprocessing (optional)

🔧 We provide some scripts to process raw data manually. [Click to expand]

0.1 Camera Egomotion Generation

Please refer to the config file preprocess/CamEgoGen/ceg.yml.

cd preprocess/CamEgoGen
python generate_homography_offline.py

0.2 Vision-Language Feature Extraction

Please clone the original GLIP repo and merge it to VLExtraction by

cd preprocess
git clone https://github.com/microsoft/GLIP
rsync -a --progress GLIP/ VLExtraction/ 
cd VLExtraction

Then install the requirements of GLIP and modify its source code to collect vision-language fusion features as follows:

1. maskrcnn_benchmark/modeling/detector/generalized_vl_rcnn.py
def forward(self, 
 ...
 return result
->    return result, fused_visual_features

2. maskrcnn_benchmark/engine/predictor_glip.py
def compute_prediction(self, original_image, 
 ...
 predictions = self.model(image_list, ...
->    predictions, visual_features = self.model(image_list, ...

3. maskrcnn_benchmark/engine/predictor_glip.py
def run_on_web_image(self, 
 ...
 predictions = self.compute_prediction(original_image, ...
->    predictions, visual_features = self.compute_prediction(original_image, ...

After modifying the params in preprocess/VLExtraction/vle.yml, you can use this script to generate GLIP features for all the videos in EgoPAT3D-DT:

python generate_homography_offline.py

Alternatively, please download the features we have produced.

0.3 Point Cloud Aggregation

We transform sequential point clouds into a unified reference frame for voxelization. Here is a demo to aggregate them. Please refer to the config file preprocess/PC2Voxel/p2v.yml.

cd preprocess/PC2Voxel
python generate_occupancy_offline.py

This is just a demo to aggregate depth points. You can also use the point clouds processed with arm masks (0.4) as inputs. Notable, our main code can automatically achieve this and save the results to the required voxel files.

Alternatively, you can download the required voxel files we have produced.

0.4 Arm Filtering for Clean Global Context

We use MobileSAM to efficiently filter our arm point clouds for clean 3D global context. Please install the environments according to this repo. Our repo has accommodated the MobileSAM repo, and you can download the weights here and put it to weights folder. Remember to modify the params in preprocess/MobileSAM/ms.yml.

cd preprocess/MobileSAM
python demo_arm_pc_filter.py

If you want to loop all the data in EgoPAT3D-DT, please run

cd preprocess/MobileSAM
python loop_arm_pc_filter_egopat3d.py

⬇️ You can directly download all our preprocessed files as follows:

Description	Link	Config
EgoPAT3D-DT from USST	EgoPAT3D-postproc	EgoPAT3D-postproc for options/expopts.py
GLIP features	glip_feats	glip_feats_path for options/expopts.py
camera egomotion	motion_feats	motion_feats_path for options/expopts.py
occupancy voxel grids	egopat_voxel_filtered	voxel_path for options/expopts.py
MobileSAM weights	mobile_sam.pt	weights for preprocess/MobileSAM
point cloud example from raw EgoPAT3D	pointcloud_bathroomCabinet_1	examples for preprocess/PC2Voxel

1. Test MMTwin on EgoPAT3D-DT

Firstly, set the GPU number and the checkpoint file in options/traineval_config.yml. Then you can evaluate the performance on EgoPAT3D-DT by

bash val_traj.sh

Note that you can modify the configurations of models and experiments in options/expopts.py. For instance, to test MMTwin on seen scenes in 2D space, please set test_novel=False and test_space="2d".

We have released the pretrained MMTwin models in this link. Feel free to download and put it in ./mmtwin_weights/.

2. Train MMTwin on EgoPAT3D-DT

We noticed that the performance gain from MHSA is marginal, so we omit it in this repo to improve computational efficiency. More versions will be released soon. To optimize MMTwin from scratch, simply run

bash train.sh

CABH Benchmark

We collected multiple egocentric videos capturing human hands performing simple object manipulation tasks. This benchmark enables rapid validation of the potential of human hand trajectory prediction models for downstream manipulation applications.

Past trajectories are shown in green, and MMTwin’s predicted future trajectories are displayed in red. The direct detection results from the visual grounding model on future frames are visualized in blue. As can be seen, MMTwin demonstrates performance comparable to the visual grounding model even if the visual grounding model can "look into the future".

⬇️ Feel free to download the raw/preprocessed data of CABH benchmark.

Task	Description	Link (raw)	Link (preprocessed)	Link (GLIP feats)	Link (motion info)	Link (train/test splits)
1	place the cup on the coaster	hand_data_red_cup	hand_data_for_pipeline_mask_redcup	glip_feats_redcup	motion_feats_redcup	train_split/test_split
2	put the apple on the plate	hand_data_red_apple	hand_data_for_pipeline_mask_redapple	glip_feats_redapple	motion_feats_redapple	train_split/test_split
3	place the box on the shelf	hand_data_box	hand_data_for_pipeline_mask_box	glip_feats_box	motion_feats_box	train_split/test_split

Link (raw): Raw RGB and depth images from headset realsense D435i.
Link (preprocessed): Preprocessed data for HTP. Please refer to read_cabh.ipynb for more details.
Link (GLIP feats): Vision-language features extracted by GLIP, a powerful visual grounding model.
Link (motion info): Camera egomotion homography.
Link (train/test splits): The dataset splits used for training and evaluation.

The implementation of MMTwin for our CABH benchmark will be released soon.

🤝 If our work is helpful to your research, we would appreciate a citation to our paper:

@misc{ma2025mmtwin,
 title={Novel Diffusion Models for Multimodal 3D Hand Trajectory Prediction}, 
 author={Junyi Ma and Wentao Bao and Jingyi Xu and Guanzhong Sun and Xieyuanli Chen and Hesheng Wang},
 year={2025},
 eprint={2504.07375},
 archivePrefix={arXiv},
 primaryClass={cs.CV},
 url={https://arxiv.org/abs/2504.07375}, 
}

Prior Works

We have released Diff-IP2D, a basic 2D HOI prediction approach. Its open-source code presents how to implement 2D hand trajectory prediction in the EPIC-KITCHENS dataset. Feel free to try it!

Acknowledgements

We gratefully acknowledge the inspiring work of DiffuSeq, hoi-forecast, USST, mamba.py and other valuable contributions from the community.

Name		Name	Last commit message	Last commit date
Latest commit History 104 Commits
data_utils		data_utils
docs		docs
log		log
mambapy		mambapy
mmtwin_weights		mmtwin_weights
models		models
netscripts		netscripts
options		options
preprocess		preprocess
LICENSE		LICENSE
README.md		README.md
basic_utils.py		basic_utils.py
read_cabh.ipynb		read_cabh.ipynb
run_train.py		run_train.py
run_val_traj.py		run_val_traj.py
train.sh		train.sh
traineval.py		traineval.py
val_traj.sh		val_traj.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

MMTwin: Novel Diffusion Models for Multimodal 3D Hand Trajectory Prediction

TODO

Suggested Data Structure

Benchmark Evaluation

EgoPAT3D

0. Data Preprocessing (optional)

1. Test MMTwin on EgoPAT3D-DT

2. Train MMTwin on EgoPAT3D-DT

CABH Benchmark

Prior Works

Acknowledgements

About

Uh oh!

Releases

Packages

Languages

License

IRMVLab/MMTwin

Folders and files

Latest commit

History

Repository files navigation

MMTwin: Novel Diffusion Models for Multimodal 3D Hand Trajectory Prediction

TODO

Suggested Data Structure

Benchmark Evaluation

EgoPAT3D

0. Data Preprocessing (optional)

1. Test MMTwin on EgoPAT3D-DT

2. Train MMTwin on EgoPAT3D-DT

CABH Benchmark

Prior Works

Acknowledgements

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages