Skip to content

aeolusguan/BridgeDepth

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

BridgeDepth

Official implementation of paper:

BridgeDepth: Bridging Monocular and Stereo Reasoning with Latent Alignment, ICCV 2025
Tongfan Guan, Jiaxin Guo, Chen Wang, Yun-Hui Liu

Abstract

Monocular and stereo depth estimation offer complementary strengths: monocular methods capture rich contextual priors but lack geometric precision, while stereo approaches leverage epipolar geometry yet struggle with ambiguities such as reflective or textureless surfaces. Despite post-hoc synergies, these paradigms remain largely disjoint in practice. We introduce a unified framework that bridges both through iterative bidirectional alignment of their latent representations. At its core, a novel cross-attentive alignment mechanism dynamically synchronizes monocular contextual cues with stereo hypothesis representations during stereo reasoning. This mutual alignment resolves stereo ambiguities (e.g., specular surfaces) by injecting monocular structure priors while refining monocular depth with stereo geometry within a single network. Extensive experiments demonstrate state-of-the-art results: it reduces zero-shot generalization error by >40% on Middlebury and ETH3D, while addressing longstanding failures on transparent and reflective surfaces. By harmonizing multi-view geometry with monocular context, our approach enables robust 3D perception that transcends modality-specific limitations.

Example of reconstructions

TLDR: A unified framework combines monocular and stereo depth estimation through iterative bidirectional alignment of latent representations, achieving state-of-the-art results and addressing ambiguities in stereo vision.

Get Started

Installation

  1. Clone BridgeDepth
git clone https://github.com/aeolusguan/BridgeDepth
cd BridgeDepth
  1. Create the environment, here we recommend using conda.
conda create -n bridgedepth python=3.10
conda activate bridgedepth
pip install torch==2.7.0 torchvision==0.22.0 --index-url https://download.pytorch.org/whl/cu126  # use the correct version of cuda for your system
pip install -r requirement.txt
# Optional, but recommend (~30% faster)
pip install xformers==0.0.30 --index-url https://download.pytorch.org/whl/cu126  # use the correct version of cuda for your system

Checkpoints

We provide several pre-trained models:

Model name Benchmark Training resolutions Stereo encoder Training Config
sf.pth Scene Flow 368x784 BasicEncoder default.py
l_sf.pth Scene Flow 368x784 ConvNext-Tiny l_train.yaml
kitti.pth KITTI 2012/2015 304x1152 ConvNext-Tiny kitti_mix_train.yaml
eth3d_pretrain.pth, eth3d.pth ETH3D 384x512 ConvNext-Tiny eth3d_pretrain.yaml, eth3d.yaml
middlebury_pretrain.pth, middlebury.pth Middlebury 384x512, 512x768 ConvNext-Tiny middlebury_pretrain.yaml, middlebury.yaml
rvc_pretrain.pth, rvc.pth Robust Vision Challenge 384x768, 384x768 ConvNext-Tiny rvc_pretrain.yaml, rvc.yaml

Run demo

python demo.py --model_name rvc_pretrain  # also try with [rvc | eth3d_pretrain | middlebury_pretrain]

You can see output disparity visualization

Point cloud output (without denoising)

Inference

To test on your own stereo image pairs, placed at $left_directory and $right_direcoty respectively

python infer.py --input $left_directory $right_directory --output $output_directory --from-pretrained rvc_pretrain # also try with [rvc | eth3d_pretrain | middlebury_pretrain]

Tips:

  • For in the wild deployment, we generally recommend the rvc_pretrain.pth checkpoint. You are encouraged to also try other models for your best fit (middlebury_pretrain.pth, eth3d_pretrain.pth, or rvc.pth maybe your favorite).
  • For high-resolution image (>720p), you are highly suggested to run with smaller scale, e.g., downsampled to 720p, not only for faster inference but also better performance.

Datasets

To train/evaluate BridgeDepth, you first need to prepare datasets following this guide.

Evaluation

To evaluate on SceneFlow test set, run

python main.py --num-gpus 4 --eval-only --from-pretrained sf  # use the number of gpus for your need
# or
python main.py --num-gpus 4 --eval-only --from-pretrained l_sf

For zero-shot generalization evaluation

python main.py --num-gpus 4 --eval-only --config-file configs/zero_shot_evaluation.yaml --from-pretrained sf

For submission to KITTI 2012/2015, ETH3D, and Middlebury online test sets, you can run:

python infer.py --dataset-name kitti_2015 --from-pretrained kitti  # produce kitti_2015_submission in current working directory
python infer.py --dataset-name kitti_2012 --from-pretrained kitti  # produce kitti_2012_submission in current working directory
python infer.py --dataset-name eth3d --output eth3d_submission --from-pretrained eth3d  # try with --from-pretrained rvc for _RVC submission
python infer.py --dataset-name middlebury_H --output middlebury_submission --from-pretrained middlebury # try with --from-pretrained rvc for _RVC submission

Training

First, download DAv2 models

mkdir checkpoints; cd checkpoints
wget https://huggingface.co/depth-anything/Depth-Anything-V2-Large/resolve/main/depth_anything_v2_vitl.pth
cd ..

Train on SceneFlow

python main.py --num-gpus 4 --checkpoint-dir checkpoints/sf
python main.py --num-gpus 4 --config-file configs/L_train.yaml --checkpoint-dir checkpoints/l_sf

Finetune for Benchmarks

# KITTI
python main.py --num-gpus 4 --config-file configs/kitti_mix_train.yaml --checkpoint-dir checkpoints/kitti SOLVER.RESUME checkpoints/l_sf/step_300000.pth
# ETH3D
python main.py --num-gpus 4 --config-file configs/eth3d_pretrain.yaml --checkpoint-dir checkpoints/eth3d_pretrain SOLVER.RESUME checkpoints/l_sf/step_300000.pth
python main.py --num-gpus 4 --config-file configs/eth3d.yaml --checkpoint-dir checkpoints/eth3d SOLVER.RESUME checkpoints/eth3d_pretrain/step_300000.pth
# Middlebury
python main.py --num-gpus 4 --config-file configs/middlebury_pretrain.yaml --checkpoint-dir checkpoints/middlebury_pretrain SOLVER.RESUME checkpoints/l_sf/step_300000.pth
python main.py --num-gpus 4 --config-file configs/middlebury.yaml --checkpoint-dir checkpoints/middlebury SOLVER.RESUME checkpoints/middlebury_pretrain/step_200000.pth
# RVC
python main.py --num-gpus 4 --config-file configs/rvc_pretrain.yaml --checkpoint-dir checkpoints/rvc_pretrain SOLVER.RESUME checkpoints/l_sf/step_300000.pth
python main.py --num-gpus 4 --config-file configs/rvc.yaml --checkpoint-dir checkpoints/rvc SOLVER.RESUME checkpoints/rvc_pretrain/step_200000.pth

We support using tensorboard to monitor the training process. You can first start a tensorboard session with

tensorboard --logdir checkpoints

and then access http://localhost:6006 in your browser.

BibTex

@article{guan2025bridgedepth,
  title={BridgeDepth: Bridging Monocular and Stereo Reasoning with Latent Alignment},
  author={Guan, Tongfan and Guo, Jiaxin and Wang, Chen and Liu, Yun-Hui},
  journal={arXiv preprint arXiv:2508.04611},
  year={2025}
}

Acknowledgement

Thanks to the authors of DepthAnything V2, NMRF, DEFOM-Stereo and FoundationStereo for their code release. Finally, thanks to ICCV reviewers and AC for their appreciation of this work and constructive feedback.

About

[ICCV 2025 Highlight] BridgeDepth: Bridging Monocular and Stereo Reasoning with Latent Alignment

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages