Junyi Ma1, Wentao Bao2, Jingyi Xu1, Guanzhong Sun3, Xieyuanli Chen4, Hesheng Wang1
1 Shanghai Jiao Tong University 2 Meta Reality Labs 3 China University of Mining and Technology 4 National University of Defense Technology
🟢 Green: past waypoints | 🔴 Red: predicted waypoints | 🔵 Blue: GT waypoints
This work has been accepted by IROS 2025 🎉 We are updating the tutorials. Your patience is appreciated :)
- Release the paper
- Release our self-collected CABH Benchmark for fast HTP evaluation 😎
- Release the code and pretrained models on EgoPAT3D
- Release the code and pretrained models on our CABH benchmark
📁 We strongly recommend following the default data structure for fast reproducing. [Click to expand]
/data/HTPdata/
|-- EgoPAT3D-postproc
|-- odometry
|-- trajectory_repair
|-- video_clips_hand
|-- pointcloud_bathroomCabinet_1 # for demo
|-- glip_feats
|-- motion_feats
|-- egopat_voxel_filtered
|-- CABH-benchmark
|-- redcup
|-- hand_data_for_pipeline_mask_redcup
|-- glip_feats_redcup
|-- motion_feats_redcup
|-- train_split.txt
|-- test_split.txt
|-- redapple
|-- box
🔧 We provide some scripts to process raw data manually. [Click to expand]
- 0.1 Camera Egomotion Generation
Please refer to the config file preprocess/CamEgoGen/ceg.yml
.
cd preprocess/CamEgoGen
python generate_homography_offline.py
- 0.2 Vision-Language Feature Extraction
Please clone the original GLIP repo and merge it to VLExtraction
by
cd preprocess
git clone https://github.com/microsoft/GLIP
rsync -a --progress GLIP/ VLExtraction/
cd VLExtraction
Then install the requirements of GLIP and modify its source code to collect vision-language fusion features as follows:
1. maskrcnn_benchmark/modeling/detector/generalized_vl_rcnn.py
def forward(self,
...
return result
-> return result, fused_visual_features
2. maskrcnn_benchmark/engine/predictor_glip.py
def compute_prediction(self, original_image,
...
predictions = self.model(image_list, ...
-> predictions, visual_features = self.model(image_list, ...
3. maskrcnn_benchmark/engine/predictor_glip.py
def run_on_web_image(self,
...
predictions = self.compute_prediction(original_image, ...
-> predictions, visual_features = self.compute_prediction(original_image, ...
After modifying the params in preprocess/VLExtraction/vle.yml
, you can use this script to generate GLIP features for all the videos in EgoPAT3D-DT:
python generate_homography_offline.py
Alternatively, please download the features we have produced.
- 0.3 Point Cloud Aggregation
We transform sequential point clouds into a unified reference frame for voxelization. Here is a demo to aggregate them. Please refer to the config file preprocess/PC2Voxel/p2v.yml
.
cd preprocess/PC2Voxel
python generate_occupancy_offline.py
This is just a demo to aggregate depth points. You can also use the point clouds processed with arm masks (0.4) as inputs. Notable, our main code can automatically achieve this and save the results to the required voxel files.
Alternatively, you can download the required voxel files we have produced.
- 0.4 Arm Filtering for Clean Global Context
We use MobileSAM to efficiently filter our arm point clouds for clean 3D global context. Please install the environments according to this repo. Our repo has accommodated the MobileSAM repo, and you can download the weights here and put it to weights
folder. Remember to modify the params in preprocess/MobileSAM/ms.yml
.
cd preprocess/MobileSAM
python demo_arm_pc_filter.py
If you want to loop all the data in EgoPAT3D-DT, please run
cd preprocess/MobileSAM
python loop_arm_pc_filter_egopat3d.py
⬇️ You can directly download all our preprocessed files as follows:
Description | Link | Config |
---|---|---|
EgoPAT3D-DT from USST | EgoPAT3D-postproc | EgoPAT3D-postproc for options/expopts.py |
GLIP features | glip_feats | glip_feats_path for options/expopts.py |
camera egomotion | motion_feats | motion_feats_path for options/expopts.py |
occupancy voxel grids | egopat_voxel_filtered | voxel_path for options/expopts.py |
MobileSAM weights | mobile_sam.pt | weights for preprocess/MobileSAM |
point cloud example from raw EgoPAT3D | pointcloud_bathroomCabinet_1 | examples for preprocess/PC2Voxel |
Firstly, set the GPU number and the checkpoint file in options/traineval_config.yml
. Then you can evaluate the performance on EgoPAT3D-DT by
bash val_traj.sh
Note that you can modify the configurations of models and experiments in options/expopts.py
. For instance, to test MMTwin on seen scenes in 2D space, please set test_novel=False
and test_space="2d"
.
We have released the pretrained MMTwin models in this link. Feel free to download and put it in ./mmtwin_weights/
.
We noticed that the performance gain from MHSA is marginal, so we omit it in this repo to improve computational efficiency. More versions will be released soon. To optimize MMTwin from scratch, simply run
bash train.sh
We collected multiple egocentric videos capturing human hands performing simple object manipulation tasks. This benchmark enables rapid validation of the potential of human hand trajectory prediction models for downstream manipulation applications.
Past trajectories are shown in green, and MMTwin’s predicted future trajectories are displayed in red. The direct detection results from the visual grounding model on future frames are visualized in blue. As can be seen, MMTwin demonstrates performance comparable to the visual grounding model even if the visual grounding model can "look into the future".
⬇️ Feel free to download the raw/preprocessed data of CABH benchmark.
Task | Description | Link (raw) | Link (preprocessed) | Link (GLIP feats) | Link (motion info) | Link (train/test splits) |
---|---|---|---|---|---|---|
1 | place the cup on the coaster | hand_data_red_cup | hand_data_for_pipeline_mask_redcup | glip_feats_redcup | motion_feats_redcup | train_split/test_split |
2 | put the apple on the plate | hand_data_red_apple | hand_data_for_pipeline_mask_redapple | glip_feats_redapple | motion_feats_redapple | train_split/test_split |
3 | place the box on the shelf | hand_data_box | hand_data_for_pipeline_mask_box | glip_feats_box | motion_feats_box | train_split/test_split |
- Link (raw): Raw RGB and depth images from headset realsense D435i.
- Link (preprocessed): Preprocessed data for HTP. Please refer to
read_cabh.ipynb
for more details. - Link (GLIP feats): Vision-language features extracted by GLIP, a powerful visual grounding model.
- Link (motion info): Camera egomotion homography.
- Link (train/test splits): The dataset splits used for training and evaluation.
The implementation of MMTwin for our CABH benchmark will be released soon.
🤝 If our work is helpful to your research, we would appreciate a citation to our paper:
@misc{ma2025mmtwin,
title={Novel Diffusion Models for Multimodal 3D Hand Trajectory Prediction},
author={Junyi Ma and Wentao Bao and Jingyi Xu and Guanzhong Sun and Xieyuanli Chen and Hesheng Wang},
year={2025},
eprint={2504.07375},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2504.07375},
}
We have released Diff-IP2D, a basic 2D HOI prediction approach. Its open-source code presents how to implement 2D hand trajectory prediction in the EPIC-KITCHENS dataset. Feel free to try it!
We gratefully acknowledge the inspiring work of DiffuSeq, hoi-forecast, USST, mamba.py and other valuable contributions from the community.