This is the official repository of the ICRA 2025 paper "Monocular Depth Estimation and Segmentation for Transparent Object with Iterative Semantic and Geometric Fusion".
Transparent object perception is indispensable for numerous robotic tasks. However, accurately segmenting and estimating the depth of transparent objects remain challenging due to complex optical properties. Existing methods primarily delve into only one task using extra inputs or specialized sensors, neglecting the valuable interactions among tasks and the subsequent refinement process, leading to suboptimal and blurry predictions. To address these issues, we propose a monocular framework, which is the first to excel in both segmentation and depth estimation of transparent objects, with only a single image input. Specifically, we devise a novel semantic and geometric fusion module, effectively integrating the multi-scale information between tasks. In addition, drawing inspiration from human perception of objects, we further incorporate an iterative strategy, which progressively refines initial features for clearer results. Experiments on two challenging synthetic and real-world datasets demonstrate that our model surpasses state-of-the-art monocular, stereo, and multi-view methods by a large margin of about 38.8%-46.2% with only a single RGB input.
We have tested on Ubuntu 20.04 with an NVIDIA GeForce RTX 4090 with Python 3.8 and cuda11.1. The code may work on other systems.
- Setup a virtual environment
python3 -m venv modest
source modest/bin/activate
- Install pip dependencies
pip install -r requirements.txt
- Download the datasets
The synthetic dataset Syn-TODD for transparent object perception can be downloaded from this repository.
The real-world dataset ClearPose can be downloaded from this repository.
- Download the model weight
We provide our pre-trained model weight on Syn-TODD dataset here.
And also weight on the real-world dataset ClearPose here.
- Modify the configuration file
Modify the parameters in config/config.json
. Specify the dataset type, all paths, batch size, and so on. Configure the wandb part if you want to visualize the running process.
To train the model on Syn-TODD or ClearPose. Simply run:
python train.py
To evaluate the model on the test set, run:
python test.py
To run the inference, specify the input image path in inference.py
and run:
python inference.py
Our code is generally built upon DPT. We thank them for their nicely open sourced code and their great contributions to the community.
If you find MODEST is useful in your research or applications, please consider citing it:
@article{liu2025monocular,
title={Monocular Depth Estimation and Segmentation for Transparent Object with Iterative Semantic and Geometric Fusion},
author={Liu, Jiangyuan and Ma, Hongxuan and Guo, Yuxin and Zhao, Yuhao and Zhang, Chi and Sui, Wei and Zou, Wei},
journal={arXiv preprint arXiv:2502.14616},
year={2025}
}