Skip to content

NJU-3DV/SpatialVID

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

SpatialVID: A Large-Scale Video Dataset with Spatial Annotations

Jiahao Wang1*  Yufeng Yuan1*  Rujie Zheng1*  Youtian Lin1  Jian Gao1  Lin-Zhuo Chen1 
Yajie Bao1  Yi Zhang1  Chang Zeng1  Yanxi Zhou1  Xiaoxiao Long1  Hao Zhu1 
Zhaoxiang Zhang2  Xun Cao1  Yao Yao1†
1Nanjing University  2Institute of Automation, Chinese Academy of Science 

       

πŸŽ‰NEWS

  • [2025.10.11] 🐳 Docker support is now available, featuring a pre-configured environment with NVIDIA GPU-accelerated FFmpeg.
  • [2025.09.29] πŸš€ Depth data for the SpatialVID-HQ dataset is now officially available.
  • [2025.09.24] πŸ€— Raw metadata access is now available via a gated HuggingFace dataset to better support community research!!
  • [2025.09.24] πŸ”­ Enhanced instructions for better camera control are updated.
  • [2025.09.18] πŸŽ† SpatialVID dataset is now available on both HuggingFace and ModelScope.
  • [2025.09.14] πŸ“’ We have also uploaded the SpatialVID-HQ dataset to ModelScope offering more diverse download options.
  • [2025.09.11] πŸ”₯ Our paper, code and SpatialVID-HQ dataset are released!

Abstract

Significant progress has been made in spatial intelligence, spanning both spatial reconstruction and world exploration. However, the scalability and real-world fidelity of current models remain severely constrained by the scarcity of large-scale, high-quality training data. While several datasets provide camera pose information, they are typically limited in scale, diversity, and annotation richness, particularly for real-world dynamic scenes with ground-truth camera motion. To this end, we collect SpatialVID, a dataset consisting of a large corpus of in-the-wild videos with diverse scenes, camera movements and dense 3D annotations such as per-frame camera poses, depth, and motion instructions. Specifically, we collect more than 21,000 hours of raw videos, and process them into 2.7 million clips through a hierarchical filtering pipeline, totaling 7,089 hours of dynamic content. A subsequent annotation pipeline enriches these clips with detailed spatial and semantic information, including camera poses, depth maps, dynamic masks, structured captions, and serialized motion instructions. Analysis of SpatialVID's data statistics reveals a richness and diversity that directly foster improved model generalization and performance, establishing it as a key asset for the video and 3D vision research community.

Preparation

This section describes how to set up the environment manually. For a simpler, containerized setup, please refer to the Docker Setup and Usage section.

Environment

  1. Necessary packages

    git clone --recursive https://github.com/NJU-3DV/SpatialVID.git
    cd SpatialVid
    conda create -n SpatialVID python=3.10.13
    conda activate SpatialVID
    pip install -r requirements/requirements.txt
  2. Package needed for scoring

    pip install paddlepaddle-gpu==3.0.0 -i https://www.paddlepaddle.org.cn/packages/stable/cu126/
    pip install -r requirements/requirements_scoring.txt

    Ignore the warning about nvidia-nccl-cu12 and numpy version, it is not a problem.

    About FFMPEG, please refer to the INSTALL.md for detailed instructions on how to install ffmpeg. After installation, replace the FFMPEG_PATH variable in the scoring/motion/inference.py and utils/cut.py with the actual path to your ffmpeg executable, default is /usr/local/bin/ffmpeg.

    [Optional] if your videos are in av1 codec instead of h264, you need to install ffmpeg (already in our requirement script), then run the following to make conda support av1 codec:

    pip uninstall opencv-python
    conda install -c conda-forge opencv==4.11.0
  3. Package needed for annotation

    pip install -r requirements/requirements_annotation.txt

    Compile the extensions for the camera tracking module:

    cd camera_pose_annotation/base
    python setup.py install
  4. [Optional] Package needed for visualization

    pip install plotly
    pip install -e viser

Model Weight

Download the model weights used in our experiments:

bash scripts/download_checkpoints.sh

Or you can manually download the model weights from the following links and place them in the appropriate directories.

Model File Name URL
Aesthetic Predictor aesthetic πŸ”—
MegaSAM megasam_final πŸ”—
RAFT raft-things πŸ”—
Depth Anything Depth-Anything-V2-Large πŸ”—
UniDepth unidepth-v2-vitl14 πŸ”—
SAM sam2.1-hiera-large πŸ”—

Quick Start

The whole pipeline is illustrated in the figure below:

  1. Scoring

    bash scripts/scoring.sh

    Inside the scoring.sh script, you need to set the following variables:

    • ROOT_VIDEO is the directory containing the input video files.
    • OUTPUT_DIR is the directory where the output files will be saved.
  2. Annotation

    bash scripts/annotation.sh

    Inside the annotation.sh script, you need to set the following variables:

    • CSV is the CSV file generated by the scoring script, default is $OUTPUT_DIR/results.csv.
    • OUTPUT_DIR is the directory where the output files will be saved.
  3. Caption

    bash scripts/caption.sh

    Inside the caption.sh script, you need to set the following variables:

    • CSV is the CSV file generated by the annotation script, default is $OUTPUT_DIR/results.csv.
    • SRC_DIR is the annotation output directory, default is the same as the OUTPUT_DIR in the annotation step.
    • OUTPUT_DIR is the directory where the output files will be saved.
    • The API keys for the LLM models used in the captioning step. You can replace them with your own API keys.
  4. Visualization

    • You can visualize the poses.npy in the reconstruction folder of each annotated clip using the visualize_pose.py script.
    • You can visualize the final annotation result(sgd_cvd_hr.npz) using the visualize_megasam.py script.

    Note that if you want to visualize any clip in our dataset, you need to use the script pack_clip_assets.py to unify the depth, RGB frames, intrinsics, extrinsics, etc. of that clip into a single npz file first. And then you can use the visualization script to visualize it.

Docker Setup and Usage

We provide a Dockerfile to create a fully configured environment that includes all dependencies, including a custom-built FFmpeg with NVIDIA acceleration. This is the recommended way to ensure reproducibility and avoid environment-related issues.

Before you begin, ensure your system environment is similar to the configuration below. Version matching is crucial for a successful compilation. The GPU needs to support HEVC; refer to the NVIDIA NVDEC Support Matrix.

Prerequisites: Setting up the Host Environment

Before building and running the Docker container, your host machine must be configured to support GPU access for Docker.

  1. NVIDIA Drivers: Ensure you have the latest NVIDIA drivers installed. You can verify this by running nvidia-smi.

  2. Docker Engine: Install Docker on your system. Follow the official instructions at docs.docker.com/engine/install/.

  3. NVIDIA Container Toolkit: This toolkit allows Docker containers to access the host's NVIDIA GPU. Install it using the following commands (for Debian/Ubuntu): To run docker containers with GPU support you have to install the nvidia container toolkit.

    # Add the GPG key
    curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
    
    # Add the repository
    curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
      sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
      sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
    
    # Update package lists and install the toolkit
    sudo apt-get install -y \
      nvidia-container-toolkit=${NVIDIA_CONTAINER_TOOLKIT_VERSION} \
      nvidia-container-toolkit-base=${NVIDIA_CONTAINER_TOOLKIT_VERSION} \
      libnvidia-container-tools=${NVIDIA_CONTAINER_TOOLKIT_VERSION} \
      libnvidia-container1=${NVIDIA_CONTAINER_TOOLKIT_VERSION}
    
    # Configure Docker to use the NVIDIA runtime
    sudo nvidia-ctk runtime configure --runtime=containerd
    
    # Restart the Docker daemon to apply the changes
    sudo systemctl restart containerd

    For other operating systems, please refer to the official NVIDIA documentation.

  4. Docker Image Pre-pulls [optional]: To accelerate the build process, we provide a script to pre-pull necessary Docker images from a mirror registry.

    bash scripts/build_gpu_docker.sh

Build and Run the Container

You can also build and run the image using standard Docker commands from the root of the repository.

  1. Build the GPU image:

    docker build -f Dockerfile.cuda \
      --build-arg NUM_JOBS=8 \
      -t spatialvid-gpu .
  2. Run the container:

    docker run --gpus all --rm -it \
      -v $(pwd):/workspace \
      -w /workspace \
      -e NVIDIA_DRIVER_CAPABILITIES=compute,video,utility \
      spatialvid-gpu bash
  3. Verify the environment (inside the container): Once inside the container, you can verify that FFmpeg and PyTorch are correctly installed and can access the GPU.

    # Check the custom FFmpeg build
    /usr/local/bin/ffmpeg -version
    
    # Check PyTorch and CUDA availability
    python3 -c "import torch; print(f'PyTorch: {torch.__version__}, CUDA: {torch.version.cuda}, GPU Available: {torch.cuda.is_available()}')"

Dataset Download

Apart from downloading the dataset using terminal commands, we provide scripts to download the SpatialVID/SpatialVID-HQ dataset from HuggingFace. Please refer to the download_SpatialVID.py script for more details.

We also provide our script to download the raw videos from YouTube. You can refer to the download_YouTube.py script for more details.

References

Thanks to the developers and contributors of the following open-source repositories, whose invaluable work has greatly inspire our project:

  • Open-Sora: An initiative dedicated to efficiently producing high-quality video.
  • MegaSaM: An accurate, fast and robust casual structure and motion from casual dynamic videos.
  • Depth Anything V2: A model for monocular depth estimation.
  • UniDepthV2: A model for universal monocular metric depth estimation.
  • SAM2: A model towards solving promptable visual segmentation in images and videos.
  • Viser: A library for interactive 3D visualization in Python.

Our repository is licensed under the Apache 2.0 License. However, if you use MegaSaM or other components in your work, please follow their license.

Citation

@misc{wang2025spatialvidlargescalevideodataset,
      title={SpatialVID: A Large-Scale Video Dataset with Spatial Annotations}, 
      author={Jiahao Wang and Yufeng Yuan and Rujie Zheng and Youtian Lin and Jian Gao and Lin-Zhuo Chen and Yajie Bao and Yi Zhang and Chang Zeng and Yanxi Zhou and Xiaoxiao Long and Hao Zhu and Zhaoxiang Zhang and Xun Cao and Yao Yao},
      year={2025},
      eprint={2509.09676},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2509.09676}, 
}