Skip to content

NVIDIA-ISAAC-ROS/isaac_ros_dnn_stereo_depth

Isaac ROS DNN Stereo Depth

NVIDIA-accelerated, deep learned stereo disparity estimation

image

Webinar Available

Learn how to use this package by watching our on-demand webinar: Using ML Models in ROS 2 to Robustly Estimate Distance to Obstacles


Overview

Deep Neural Network (DNN)–based stereo models have become essential for depth estimation because they overcome many of the fundamental limitations of classical and geometry-based stereo algorithms.

Traditional stereo matching relies on explicitly finding pixel correspondences between left and right images using handcrafted features. While effective in well-textured, ideal conditions, these approaches often fail in “ill-posed” regions such as areas with reflections, specular highlights, texture-less surfaces, repetitive patterns, occlusions, or even minor camera calibration errors. In such cases, classical algorithms may produce incomplete or inaccurate depth maps, or be forced to discard information entirely, especially when context-dependent filtering is not possible.

DNN-based stereo methods learn rich, hierarchical feature representations and context-aware matching costs directly from data. These models leverage semantic understanding and global scene context to infer depth, even in challenging environments where traditional correspondence measures break down. Through training, DNNs can implicitly account for real-world imperfections such as:

  • calibration errors
  • exposure differences
  • hardware noise

Training increases DNN’s ability to recognize and handle difficult regions like reflections or transparent surfaces. This results in more robust, accurate, and dense depth predictions.

These advances are critical for robotics and autonomous systems, enabling applications where both speed and accuracy of depth perception are essential, such as:

  • precise robotic arm manipulation
  • reliable obstacle avoidance and navigation
  • robust target tracking in dynamic or cluttered environments

DNN-based stereo methods consistently outperform classical techniques, making them the preferred choice for modern depth perception tasks.

The superiority of DNN-based stereo methods is clearly demonstrated in the figure above where we compare the output from a classical stereo algorithm, SGM, with DNN-based methods, ESS, and FoundationStereo.

SGM produces a very noisy and error-prone disparity map, while ESS and FoundationStereo produce much smoother and more accurate disparity maps. A closer look reveals that FoundationStereo produces the most accurate map because it is better at handling the plant in the distance and the railings on the left with smoother estimates. Overall, you can see that FoundationStereo is better than ESS, and better than SGM, in terms of accuracy and quality.

DNN‐based stereo systems begin by passing the left and right images through shared Convolutional backbones to extract multi‐scale feature maps that encode both texture and semantic information. These feature maps are then compared across potential disparities by constructing a learnable cost volume, which effectively represents the matching likelihood of each pixel at different disparities. Successive 3D Convolutional (or 2D convolution + aggregation) stages then regularize and refine this cost volume, integrating strong local cues—like edges and textures—and global scene context—such as object shapes and layout priors—to resolve ambiguities. Finally, a soft‐argmax or classification layer converts the refined cost volume into a dense disparity map, often followed by lightweight refinement modules that enforce sub-pixel accuracy and respect learned priors (for example, smoothness within objects, sharp transitions at boundaries), yielding a coherent estimate that gracefully handles challenging scenarios where classical algorithms falter.

Isaac ROS NITROS Acceleration

This package is powered by NVIDIA Isaac Transport for ROS (NITROS), which leverages type adaptation and negotiation to optimize message formats and dramatically accelerate communication between participating nodes.

Performance

Sample Graph

Input Size

AGX Thor

x86_64 w/ RTX 5090

DNN Stereo Disparity Node


Full

576p

178 fps


22 ms @ 30Hz

350 fps


5.6 ms @ 30Hz

DNN Stereo Disparity Node


Light

288p

350 fps


9.4 ms @ 30Hz

350 fps


5.0 ms @ 30Hz

DNN Stereo Disparity Graph


Full

576p

73.6 fps


29 ms @ 30Hz

348 fps


8.5 ms @ 30Hz

DNN Stereo Disparity Graph


Light

288p

219 fps


17 ms @ 30Hz

350 fps


7.3 ms @ 30Hz


Documentation

Please visit the Isaac ROS Documentation to learn how to use this repository.


Packages

Latest

Update 2025-10-24: Added FoundationStereo package