Skip to content

hanyang-21/VideoScene

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

VideoScene: Distilling Video Diffusion Model to Generate 3D Scenes in One Step

Hanyang Wang*, Fangfu Liu*, Jiawei Chi, Yueqi Duan
*Equal Contribution.
Tsinghua University

CVPR 2025 Hightlight 🔥

arXiv Home Page

VideoScene is a one-step video diffusion model that bridges the gap from video to 3D.

teaser_video.mp4

Building on ReconX, VideoScene has achieved a turbo-version advancement.

Installation

To get started, clone this project, create a conda virtual environment using Python 3.10+, and install the requirements:

  1. Clone VideoScene.
git clone https://github.com/hanyang-21/VideoScene
cd VideoScene
  1. Create the environment, here we show an example using conda.
conda create -y -n videoscene python=3.10
conda activate videoscene
pip install torch==2.1.2 torchvision==0.16.2 torchaudio==2.1.2 --index-url https://download.pytorch.org/whl/cu118
pip install -r requirements.txt
  1. Optional, compile the cuda kernels for RoPE (as in CroCo v2).
# NoPoSplat relies on RoPE positional embeddings for which you can compile some cuda kernels for faster runtime.
cd src/model/encoder/backbone/croco/curope/
python setup.py build_ext --inplace
cd ../../../../../..

Acquiring Datasets

RealEstate10K and ACID

Our VideoScene uses the same training datasets as pixelSplat. Below we quote pixelSplat's detailed instructions on getting datasets.

pixelSplat was trained using versions of the RealEstate10k and ACID datasets that were split into ~100 MB chunks for use on server cluster file systems. Small subsets of the Real Estate 10k and ACID datasets in this format can be found here. To use them, simply unzip them into a newly created datasets folder in the project root directory.

If you would like to convert downloaded versions of the Real Estate 10k and ACID datasets to our format, you can use the scripts here. Reach out to us (pixelSplat) if you want the full versions of our processed datasets, which are about 500 GB and 160 GB for Real Estate 10k and ACID respectively.

Downloading checkpoints

  • download our pretrained weights (VideoScene/checkpoints/model.safetensors and VideoScene/checkpoints/prompt_embeds.pt), and save them to checkpoints.

  • for customized image inputs, get the NoPoSplat pretrained models, and save them to checkpoints/noposplat.

  • for RealEstate10K datasets, get the MVSplat pretrained models, and save them to checkpoints/mvsplat.

Running the Code

Gradio Demo

In this demo, you can run VideoScene on your machine to generate a video with unposed input views.

  • select image pairs that depicts the same scene and hit "RUN" for a video of the scene.
python -m noposplat.src.app \
    checkpointing.load=checkpoints/noposplat/mixRe10kDl3dv_512x512.ckpt \
    test.video=checkpoints/model.safetensors

# also "bash demo.sh"
  • the generated video will be stored under outputs/gradio

Inference

To generate videos on RealEstate10K dataseets, we use a MVSplat pretrained model,

  • run the following:
# re10k
python -m mvsplat.src.main +experiment=re10k \
checkpointing.load=checkpoints/mvsplat/re10k.ckpt \
mode=test \
dataset/view_sampler=evaluation \
dataset.view_sampler.index_path=mvsplat/assets/evaluation_index_re10k_video.json \
test.save_video=true \
test.save_image=false \
test.compute_scores=false \
test.video=checkpoints/model.safetensors

# also "bash inference.sh"
  • the generated video will be stored under outputs/test

BibTeX

@misc{wang2025videoscenedistillingvideodiffusion,
      title={VideoScene: Distilling Video Diffusion Model to Generate 3D Scenes in One Step}, 
      author={Hanyang Wang and Fangfu Liu and Jiawei Chi and Yueqi Duan},
      year={2025},
      eprint={2504.01956},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2504.01956}, 
}

Acknowledgements

This project is developed with several fantastic repos: ReconX, MVSplat, NoPoSplat, CogVideo, and CogvideX-Interpolation. Many thanks to these projects for their excellent contributions!

About

[CVPR 2025 Highlight] VideoScene: Distilling Video Diffusion Model to Generate 3D Scenes in One Step

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages