SpatialVID: A Large-Scale Video Dataset with Spatial Annotations

Jiahao Wang^1* Yufeng Yuan^1* Rujie Zheng^1* Youtian Lin¹ Jian Gao¹ Lin-Zhuo Chen¹

Yajie Bao¹ Yi Zhang¹ Chang Zeng¹ Yanxi Zhou¹ Xiaoxiao Long¹ Hao Zhu¹

Zhaoxiang Zhang² Xun Cao¹ Yao Yao^1†

¹Nanjing University ²Institute of Automation, Chinese Academy of Science

🎉NEWS

[2025.10.11] 🐳 Docker support is now available, featuring a pre-configured environment with NVIDIA GPU-accelerated FFmpeg.
[2025.09.29] 🚀 Depth data for the SpatialVID-HQ dataset is now officially available.
[2025.09.24] 🤗 Raw metadata access is now available via a gated HuggingFace dataset to better support community research!!
[2025.09.24] 🔭 Enhanced instructions for better camera control are updated.
[2025.09.18] 🎆 SpatialVID dataset is now available on both HuggingFace and ModelScope.
[2025.09.14] 📢 We have also uploaded the SpatialVID-HQ dataset to ModelScope offering more diverse download options.
[2025.09.11] 🔥 Our paper, code and SpatialVID-HQ dataset are released!

Abstract

Significant progress has been made in spatial intelligence, spanning both spatial reconstruction and world exploration. However, the scalability and real-world fidelity of current models remain severely constrained by the scarcity of large-scale, high-quality training data. While several datasets provide camera pose information, they are typically limited in scale, diversity, and annotation richness, particularly for real-world dynamic scenes with ground-truth camera motion. To this end, we collect SpatialVID, a dataset consisting of a large corpus of in-the-wild videos with diverse scenes, camera movements and dense 3D annotations such as per-frame camera poses, depth, and motion instructions. Specifically, we collect more than 21,000 hours of raw videos, and process them into 2.7 million clips through a hierarchical filtering pipeline, totaling 7,089 hours of dynamic content. A subsequent annotation pipeline enriches these clips with detailed spatial and semantic information, including camera poses, depth maps, dynamic masks, structured captions, and serialized motion instructions. Analysis of SpatialVID's data statistics reveals a richness and diversity that directly foster improved model generalization and performance, establishing it as a key asset for the video and 3D vision research community.

Preparation

This section describes how to set up the environment manually. For a simpler, containerized setup, please refer to the Docker Setup and Usage section.

Environment

Necessary packages

git clone --recursive https://github.com/NJU-3DV/SpatialVID.git
cd SpatialVid
conda create -n SpatialVID python=3.10.13
conda activate SpatialVID
pip install -r requirements/requirements.txt

Package needed for scoring
```
pip install paddlepaddle-gpu==3.0.0 -i https://www.paddlepaddle.org.cn/packages/stable/cu126/
pip install -r requirements/requirements_scoring.txt
```
Ignore the warning about nvidia-nccl-cu12 and numpy version, it is not a problem.

About FFMPEG, please refer to the INSTALL.md for detailed instructions on how to install ffmpeg. After installation, replace the FFMPEG_PATH variable in the scoring/motion/inference.py and utils/cut.py with the actual path to your ffmpeg executable, default is /usr/local/bin/ffmpeg.

[Optional] if your videos are in av1 codec instead of h264, you need to install ffmpeg (already in our requirement script), then run the following to make conda support av1 codec:
```
pip uninstall opencv-python
conda install -c conda-forge opencv==4.11.0
```

Package needed for annotation

pip install -r requirements/requirements_annotation.txt

Compile the extensions for the camera tracking module:

cd camera_pose_annotation/base
python setup.py install

[Optional] Package needed for visualization
```
pip install plotly
pip install -e viser
```

Model Weight

Download the model weights used in our experiments:

bash scripts/download_checkpoints.sh

Or you can manually download the model weights from the following links and place them in the appropriate directories.

Model	File Name	URL
Aesthetic Predictor	aesthetic	🔗
MegaSAM	megasam_final	🔗
RAFT	raft-things	🔗
Depth Anything	Depth-Anything-V2-Large	🔗
UniDepth	unidepth-v2-vitl14	🔗
SAM	sam2.1-hiera-large	🔗

Quick Start

The whole pipeline is illustrated in the figure below:

Scoring
```
bash scripts/scoring.sh
```
Inside the scoring.sh script, you need to set the following variables:
- ROOT_VIDEO is the directory containing the input video files.
- OUTPUT_DIR is the directory where the output files will be saved.
Annotation
```
bash scripts/annotation.sh
```
Inside the annotation.sh script, you need to set the following variables:
- CSV is the CSV file generated by the scoring script, default is $OUTPUT_DIR/results.csv.
- OUTPUT_DIR is the directory where the output files will be saved.
Caption
```
bash scripts/caption.sh
```
Inside the caption.sh script, you need to set the following variables:
- CSV is the CSV file generated by the annotation script, default is $OUTPUT_DIR/results.csv.
- SRC_DIR is the annotation output directory, default is the same as the OUTPUT_DIR in the annotation step.
- OUTPUT_DIR is the directory where the output files will be saved.
- The API keys for the LLM models used in the captioning step. You can replace them with your own API keys.
Visualization
- You can visualize the poses.npy in the reconstruction folder of each annotated clip using the visualize_pose.py script.
- You can visualize the final annotation result(sgd_cvd_hr.npz) using the visualize_megasam.py script.
Note that if you want to visualize any clip in our dataset, you need to use the script pack_clip_assets.py to unify the depth, RGB frames, intrinsics, extrinsics, etc. of that clip into a single npz file first. And then you can use the visualization script to visualize it.

Docker Setup and Usage

We provide a Dockerfile to create a fully configured environment that includes all dependencies, including a custom-built FFmpeg with NVIDIA acceleration. This is the recommended way to ensure reproducibility and avoid environment-related issues.

Before you begin, ensure your system environment is similar to the configuration below. Version matching is crucial for a successful compilation. The GPU needs to support HEVC; refer to the NVIDIA NVDEC Support Matrix.

Prerequisites: Setting up the Host Environment

Before building and running the Docker container, your host machine must be configured to support GPU access for Docker.

NVIDIA Drivers: Ensure you have the latest NVIDIA drivers installed. You can verify this by running nvidia-smi.
Docker Engine: Install Docker on your system. Follow the official instructions at docs.docker.com/engine/install/.

NVIDIA Container Toolkit: This toolkit allows Docker containers to access the host's NVIDIA GPU. Install it using the following commands (for Debian/Ubuntu): To run docker containers with GPU support you have to install the nvidia container toolkit.

# Add the GPG key
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg

# Add the repository
curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
  sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
  sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

# Update package lists and install the toolkit
sudo apt-get install -y \
  nvidia-container-toolkit=${NVIDIA_CONTAINER_TOOLKIT_VERSION} \
  nvidia-container-toolkit-base=${NVIDIA_CONTAINER_TOOLKIT_VERSION} \
  libnvidia-container-tools=${NVIDIA_CONTAINER_TOOLKIT_VERSION} \
  libnvidia-container1=${NVIDIA_CONTAINER_TOOLKIT_VERSION}

# Configure Docker to use the NVIDIA runtime
sudo nvidia-ctk runtime configure --runtime=containerd

# Restart the Docker daemon to apply the changes
sudo systemctl restart containerd

For other operating systems, please refer to the official NVIDIA documentation.

Docker Image Pre-pulls [optional]: To accelerate the build process, we provide a script to pre-pull necessary Docker images from a mirror registry.
```
bash scripts/build_gpu_docker.sh
```

Build and Run the Container

You can also build and run the image using standard Docker commands from the root of the repository.

Build the GPU image:

docker build -f Dockerfile.cuda \
  --build-arg NUM_JOBS=8 \
  -t spatialvid-gpu .

Run the container:

docker run --gpus all --rm -it \
  -v $(pwd):/workspace \
  -w /workspace \
  -e NVIDIA_DRIVER_CAPABILITIES=compute,video,utility \
  spatialvid-gpu bash

Verify the environment (inside the container): Once inside the container, you can verify that FFmpeg and PyTorch are correctly installed and can access the GPU.

# Check the custom FFmpeg build
/usr/local/bin/ffmpeg -version

# Check PyTorch and CUDA availability
python3 -c "import torch; print(f'PyTorch: {torch.__version__}, CUDA: {torch.version.cuda}, GPU Available: {torch.cuda.is_available()}')"

Dataset Download

Apart from downloading the dataset using terminal commands, we provide scripts to download the SpatialVID/SpatialVID-HQ dataset from HuggingFace. Please refer to the download_SpatialVID.py script for more details.

We also provide our script to download the raw videos from YouTube. You can refer to the download_YouTube.py script for more details.

References

Thanks to the developers and contributors of the following open-source repositories, whose invaluable work has greatly inspire our project:

Open-Sora: An initiative dedicated to efficiently producing high-quality video.
MegaSaM: An accurate, fast and robust casual structure and motion from casual dynamic videos.
Depth Anything V2: A model for monocular depth estimation.
UniDepthV2: A model for universal monocular metric depth estimation.
SAM2: A model towards solving promptable visual segmentation in images and videos.
Viser: A library for interactive 3D visualization in Python.

Our repository is licensed under the Apache 2.0 License. However, if you use MegaSaM or other components in your work, please follow their license.

Citation

@misc{wang2025spatialvidlargescalevideodataset,
      title={SpatialVID: A Large-Scale Video Dataset with Spatial Annotations}, 
      author={Jiahao Wang and Yufeng Yuan and Rujie Zheng and Youtian Lin and Jian Gao and Lin-Zhuo Chen and Yajie Bao and Yi Zhang and Chang Zeng and Yanxi Zhou and Xiaoxiao Long and Hao Zhu and Zhaoxiang Zhang and Xun Cao and Yao Yao},
      year={2025},
      eprint={2509.09676},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2509.09676}, 
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

SpatialVID: A Large-Scale Video Dataset with Spatial Annotations

🎉NEWS

Abstract

Preparation

Environment

Model Weight

Quick Start

Docker Setup and Usage

Prerequisites: Setting up the Host Environment

Build and Run the Container

Dataset Download

References

Citation

About

Uh oh!

Releases

Contributors 3

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 163 Commits
assets		assets
camera_pose_annotation		camera_pose_annotation
caption		caption
requirements		requirements
scoring		scoring
scripts		scripts
utils		utils
viser		viser
.gitignore		.gitignore
.gitmodules		.gitmodules
Dockerfile.cuda		Dockerfile.cuda
LICENSE		LICENSE
README.md		README.md
docker-entrypoint.sh		docker-entrypoint.sh

License

NJU-3DV/SpatialVID

Folders and files

Latest commit

History

Repository files navigation

SpatialVID: A Large-Scale Video Dataset with Spatial Annotations

🎉NEWS

Abstract

Preparation

Environment

Model Weight

Quick Start

Docker Setup and Usage

Prerequisites: Setting up the Host Environment

Build and Run the Container

Dataset Download

References

Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Contributors 3

Languages