Sili Chen · Hengkai Guo† · Shengnan Zhu · Feihu Zhang
Zilong Huang · Jiashi Feng · Bingyi Kang†
ByteDance
†Corresponding author
This work presents Video Depth Anything based on Depth Anything V2, which can be applied to arbitrarily long videos without compromising quality, consistency, or generalization ability. Compared with other diffusion-based models, it enjoys faster inference speed, fewer parameters, and higher consistent depth accuracy.
- 2025-07-03: 🚀🚀🚀 Release an experimental version of training-free streaming video depth estimation.
- 2025-07-03: Release our implementation of training loss.
- 2025-04-25: 🌟🌟🌟 Release metric depth model based on Video-Depth-Anything-Large.
- 2025-04-05: Our paper has been accepted for a highlight presentation at CVPR 2025 (13.5% of the accepted papers).
- 2025-03-11: Add full dataset inference and evaluation scripts.
- 2025-02-08: Enable autocast inference. Support grayscale video, NPZ and EXR output formats.
- 2025-01-21: Paper, project page, code, models, and demo are all released.
-
2025-02-08: 🚀🚀🚀 Inference speed and memory usage improvement
Model Latency (ms) GPU VRAM (GB) FP32 FP16 FP32 FP16 Video-Depth-Anything-V2-Small 9.1 7.5 7.3 6.8 Video-Depth-Anything-V2-Large 67 14 26.7 23.6 The Latency and GPU VRAM results are obtained on a single A100 GPU with input of shape 1 x 32 x 518 × 518.
We provide two models of varying scales for robust and consistent video depth estimation:
Model | Params | Checkpoint |
---|---|---|
Video-Depth-Anything-V2-Small | 28.4M | Download |
Video-Depth-Anything-V2-Large | 381.8M | Download |
Video-Depth-Anything-V2-Large-Metric | 381.8M | Download |
git clone https://github.com/DepthAnything/Video-Depth-Anything
cd Video-Depth-Anything
pip install -r requirements.txt
Download the checkpoints listed here and put them under the checkpoints
directory.
bash get_weights.sh
python3 run.py --input_video ./assets/example_videos/davis_rollercoaster.mp4 --output_dir ./outputs --encoder vitl
Options:
--input_video
: path of input video--output_dir
: path to save the output results--input_size
(optional): By default, we use input size518
for model inference.--max_res
(optional): By default, we use maximum resolution1280
for model inference.--encoder
(optional):vits
for Video-Depth-Anything-V2-Small,vitl
for Video-Depth-Anything-V2-Large.--max_len
(optional): maximum length of the input video,-1
means no limit--target_fps
(optional): target fps of the input video,-1
means the original fps--fp32
(optional): Usefp32
precision for inference. By default, we usefp16
.--grayscale
(optional): Save the grayscale depth map, without applying color palette.--save_npz
(optional): Save the depth map innpz
format.--save_exr
(optional): Save the depth map inexr
format.
We implement an experimental streaming mode without training. In details, we save the hidden states of temporal attentions for each frames in the caches, and only send a single frame into our video depth model during inference by reusing these past hidden states in temporal attentions. We hack our pipeline to align the original inference setting in the offline mode. Due to the inevitable gap between training and testing, we observe a performance drop between the streaming model and the offline model (e.g. the d1
of ScanNet drops from 0.926
to 0.836
). Finetuning the model in the streaming mode will greatly improve the performance. We leave it for future work.
To run the streaming model:
python3 run_streaming.py --input_video ./assets/example_videos/davis_rollercoaster.mp4 --output_dir ./outputs_streaming --encoder vitl
Options:
--input_video
: path of input video--output_dir
: path to save the output results--input_size
(optional): By default, we use input size518
for model inference.--max_res
(optional): By default, we use maximum resolution1280
for model inference.--encoder
(optional):vits
for Video-Depth-Anything-V2-Small,vitl
for Video-Depth-Anything-V2-Large.--max_len
(optional): maximum length of the input video,-1
means no limit--target_fps
(optional): target fps of the input video,-1
means the original fps--fp32
(optional): Usefp32
precision for inference. By default, we usefp16
.--grayscale
(optional): Save the grayscale depth map, without applying color palette.
Our training loss is in loss/
directory. Please see the loss/test_loss.py
for usage.
Please refer to Metric Depth.
Please refer to Benchmark.
If you find this project useful, please consider citing:
@article{video_depth_anything,
title={Video Depth Anything: Consistent Depth Estimation for Super-Long Videos},
author={Chen, Sili and Guo, Hengkai and Zhu, Shengnan and Zhang, Feihu and Huang, Zilong and Feng, Jiashi and Kang, Bingyi}
journal={arXiv:2501.12375},
year={2025}
}
Video-Depth-Anything-Small model is under the Apache-2.0 license. Video-Depth-Anything-Large model is under the CC-BY-NC-4.0 license. For business cooperation, please send an email to Hengkai Guo at guohengkaighk@gmail.com.