- [2025.10.11] π³ Docker support is now available, featuring a pre-configured environment with NVIDIA GPU-accelerated FFmpeg.
- [2025.09.29] π Depth data for the SpatialVID-HQ dataset is now officially available.
- [2025.09.24] π€ Raw metadata access is now available via a gated HuggingFace dataset to better support community research!!
- [2025.09.24] π Enhanced instructions for better camera control are updated.
- [2025.09.18] π SpatialVID dataset is now available on both HuggingFace and ModelScope.
- [2025.09.14] π’ We have also uploaded the SpatialVID-HQ dataset to ModelScope offering more diverse download options.
- [2025.09.11] π₯ Our paper, code and SpatialVID-HQ dataset are released!
Significant progress has been made in spatial intelligence, spanning both spatial reconstruction and world exploration. However, the scalability and real-world fidelity of current models remain severely constrained by the scarcity of large-scale, high-quality training data. While several datasets provide camera pose information, they are typically limited in scale, diversity, and annotation richness, particularly for real-world dynamic scenes with ground-truth camera motion. To this end, we collect SpatialVID, a dataset consisting of a large corpus of in-the-wild videos with diverse scenes, camera movements and dense 3D annotations such as per-frame camera poses, depth, and motion instructions. Specifically, we collect more than 21,000 hours of raw videos, and process them into 2.7 million clips through a hierarchical filtering pipeline, totaling 7,089 hours of dynamic content. A subsequent annotation pipeline enriches these clips with detailed spatial and semantic information, including camera poses, depth maps, dynamic masks, structured captions, and serialized motion instructions. Analysis of SpatialVID's data statistics reveals a richness and diversity that directly foster improved model generalization and performance, establishing it as a key asset for the video and 3D vision research community.
This section describes how to set up the environment manually. For a simpler, containerized setup, please refer to the Docker Setup and Usage section.
-
Necessary packages
git clone --recursive https://github.com/NJU-3DV/SpatialVID.git cd SpatialVid conda create -n SpatialVID python=3.10.13 conda activate SpatialVID pip install -r requirements/requirements.txt -
Package needed for scoring
pip install paddlepaddle-gpu==3.0.0 -i https://www.paddlepaddle.org.cn/packages/stable/cu126/ pip install -r requirements/requirements_scoring.txt
Ignore the warning about
nvidia-nccl-cu12andnumpyversion, it is not a problem.About FFMPEG, please refer to the
INSTALL.mdfor detailed instructions on how to install ffmpeg. After installation, replace theFFMPEG_PATHvariable in thescoring/motion/inference.pyandutils/cut.pywith the actual path to your ffmpeg executable, default is/usr/local/bin/ffmpeg.[Optional] if your videos are in av1 codec instead of h264, you need to install ffmpeg (already in our requirement script), then run the following to make conda support av1 codec:
pip uninstall opencv-python conda install -c conda-forge opencv==4.11.0
-
Package needed for annotation
pip install -r requirements/requirements_annotation.txt
Compile the extensions for the camera tracking module:
cd camera_pose_annotation/base python setup.py install -
[Optional] Package needed for visualization
pip install plotly pip install -e viser
Download the model weights used in our experiments:
bash scripts/download_checkpoints.shOr you can manually download the model weights from the following links and place them in the appropriate directories.
| Model | File Name | URL |
|---|---|---|
| Aesthetic Predictor | aesthetic | π |
| MegaSAM | megasam_final | π |
| RAFT | raft-things | π |
| Depth Anything | Depth-Anything-V2-Large | π |
| UniDepth | unidepth-v2-vitl14 | π |
| SAM | sam2.1-hiera-large | π |
The whole pipeline is illustrated in the figure below:
-
Scoring
bash scripts/scoring.sh
Inside the
scoring.shscript, you need to set the following variables:ROOT_VIDEOis the directory containing the input video files.OUTPUT_DIRis the directory where the output files will be saved.
-
Annotation
bash scripts/annotation.sh
Inside the
annotation.shscript, you need to set the following variables:CSVis the CSV file generated by the scoring script, default is$OUTPUT_DIR/results.csv.OUTPUT_DIRis the directory where the output files will be saved.
-
Caption
bash scripts/caption.sh
Inside the
caption.shscript, you need to set the following variables:CSVis the CSV file generated by the annotation script, default is$OUTPUT_DIR/results.csv.SRC_DIRis the annotation output directory, default is the same as theOUTPUT_DIRin the annotation step.OUTPUT_DIRis the directory where the output files will be saved.- The API keys for the LLM models used in the captioning step. You can replace them with your own API keys.
-
Visualization
- You can visualize the
poses.npyin thereconstructionfolder of each annotated clip using thevisualize_pose.pyscript. - You can visualize the final annotation result(
sgd_cvd_hr.npz) using thevisualize_megasam.pyscript.
Note that if you want to visualize any clip in our dataset, you need to use the script
pack_clip_assets.pyto unify the depth, RGB frames, intrinsics, extrinsics, etc. of that clip into a single npz file first. And then you can use the visualization script to visualize it. - You can visualize the
We provide a Dockerfile to create a fully configured environment that includes all dependencies, including a custom-built FFmpeg with NVIDIA acceleration. This is the recommended way to ensure reproducibility and avoid environment-related issues.
Before you begin, ensure your system environment is similar to the configuration below. Version matching is crucial for a successful compilation. The GPU needs to support HEVC; refer to the NVIDIA NVDEC Support Matrix.
Before building and running the Docker container, your host machine must be configured to support GPU access for Docker.
-
NVIDIA Drivers: Ensure you have the latest NVIDIA drivers installed. You can verify this by running
nvidia-smi. -
Docker Engine: Install Docker on your system. Follow the official instructions at docs.docker.com/engine/install/.
-
NVIDIA Container Toolkit: This toolkit allows Docker containers to access the host's NVIDIA GPU. Install it using the following commands (for Debian/Ubuntu): To run docker containers with GPU support you have to install the nvidia container toolkit.
# Add the GPG key curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg # Add the repository curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \ sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \ sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list # Update package lists and install the toolkit sudo apt-get install -y \ nvidia-container-toolkit=${NVIDIA_CONTAINER_TOOLKIT_VERSION} \ nvidia-container-toolkit-base=${NVIDIA_CONTAINER_TOOLKIT_VERSION} \ libnvidia-container-tools=${NVIDIA_CONTAINER_TOOLKIT_VERSION} \ libnvidia-container1=${NVIDIA_CONTAINER_TOOLKIT_VERSION} # Configure Docker to use the NVIDIA runtime sudo nvidia-ctk runtime configure --runtime=containerd # Restart the Docker daemon to apply the changes sudo systemctl restart containerd
For other operating systems, please refer to the official NVIDIA documentation.
-
Docker Image Pre-pulls [optional]: To accelerate the build process, we provide a script to pre-pull necessary Docker images from a mirror registry.
bash scripts/build_gpu_docker.sh
You can also build and run the image using standard Docker commands from the root of the repository.
-
Build the GPU image:
docker build -f Dockerfile.cuda \ --build-arg NUM_JOBS=8 \ -t spatialvid-gpu . -
Run the container:
docker run --gpus all --rm -it \ -v $(pwd):/workspace \ -w /workspace \ -e NVIDIA_DRIVER_CAPABILITIES=compute,video,utility \ spatialvid-gpu bash -
Verify the environment (inside the container): Once inside the container, you can verify that FFmpeg and PyTorch are correctly installed and can access the GPU.
# Check the custom FFmpeg build /usr/local/bin/ffmpeg -version # Check PyTorch and CUDA availability python3 -c "import torch; print(f'PyTorch: {torch.__version__}, CUDA: {torch.version.cuda}, GPU Available: {torch.cuda.is_available()}')"
Apart from downloading the dataset using terminal commands, we provide scripts to download the SpatialVID/SpatialVID-HQ dataset from HuggingFace. Please refer to the download_SpatialVID.py script for more details.
We also provide our script to download the raw videos from YouTube. You can refer to the download_YouTube.py script for more details.
Thanks to the developers and contributors of the following open-source repositories, whose invaluable work has greatly inspire our project:
- Open-Sora: An initiative dedicated to efficiently producing high-quality video.
- MegaSaM: An accurate, fast and robust casual structure and motion from casual dynamic videos.
- Depth Anything V2: A model for monocular depth estimation.
- UniDepthV2: A model for universal monocular metric depth estimation.
- SAM2: A model towards solving promptable visual segmentation in images and videos.
- Viser: A library for interactive 3D visualization in Python.
Our repository is licensed under the Apache 2.0 License. However, if you use MegaSaM or other components in your work, please follow their license.
@misc{wang2025spatialvidlargescalevideodataset,
title={SpatialVID: A Large-Scale Video Dataset with Spatial Annotations},
author={Jiahao Wang and Yufeng Yuan and Rujie Zheng and Youtian Lin and Jian Gao and Lin-Zhuo Chen and Yajie Bao and Yi Zhang and Chang Zeng and Yanxi Zhou and Xiaoxiao Long and Hao Zhu and Zhaoxiang Zhang and Xun Cao and Yao Yao},
year={2025},
eprint={2509.09676},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2509.09676},
}
