VideoScene: Distilling Video Diffusion Model to Generate 3D Scenes in One Step

Hanyang Wang^*, Fangfu Liu^*, Jiawei Chi, Yueqi Duan
^*Equal Contribution.
Tsinghua University

CVPR 2025 Hightlight 🔥

VideoScene is a one-step video diffusion model that bridges the gap from video to 3D.

teaser_video.mp4

Building on ReconX, VideoScene has achieved a turbo-version advancement.

Installation

To get started, clone this project, create a conda virtual environment using Python 3.10+, and install the requirements:

Clone VideoScene.

git clone https://github.com/hanyang-21/VideoScene
cd VideoScene

Create the environment, here we show an example using conda.

conda create -y -n videoscene python=3.10
conda activate videoscene
pip install torch==2.1.2 torchvision==0.16.2 torchaudio==2.1.2 --index-url https://download.pytorch.org/whl/cu118
pip install -r requirements.txt

Optional, compile the cuda kernels for RoPE (as in CroCo v2).

# NoPoSplat relies on RoPE positional embeddings for which you can compile some cuda kernels for faster runtime.
cd src/model/encoder/backbone/croco/curope/
python setup.py build_ext --inplace
cd ../../../../../..

Acquiring Datasets

RealEstate10K and ACID

Our VideoScene uses the same training datasets as pixelSplat. Below we quote pixelSplat's detailed instructions on getting datasets.

pixelSplat was trained using versions of the RealEstate10k and ACID datasets that were split into ~100 MB chunks for use on server cluster file systems. Small subsets of the Real Estate 10k and ACID datasets in this format can be found here. To use them, simply unzip them into a newly created datasets folder in the project root directory.

If you would like to convert downloaded versions of the Real Estate 10k and ACID datasets to our format, you can use the scripts here. Reach out to us (pixelSplat) if you want the full versions of our processed datasets, which are about 500 GB and 160 GB for Real Estate 10k and ACID respectively.

Downloading checkpoints

download our pretrained weights (VideoScene/checkpoints/model.safetensors and VideoScene/checkpoints/prompt_embeds.pt), and save them to checkpoints.
for customized image inputs, get the NoPoSplat pretrained models, and save them to checkpoints/noposplat.
for RealEstate10K datasets, get the MVSplat pretrained models, and save them to checkpoints/mvsplat.

Running the Code

Gradio Demo

In this demo, you can run VideoScene on your machine to generate a video with unposed input views.

select image pairs that depicts the same scene and hit "RUN" for a video of the scene.

python -m noposplat.src.app \
    checkpointing.load=checkpoints/noposplat/mixRe10kDl3dv_512x512.ckpt \
    test.video=checkpoints/model.safetensors

# also "bash demo.sh"

the generated video will be stored under outputs/gradio

Inference

To generate videos on RealEstate10K dataseets, we use a MVSplat pretrained model,

run the following:

# re10k
python -m mvsplat.src.main +experiment=re10k \
checkpointing.load=checkpoints/mvsplat/re10k.ckpt \
mode=test \
dataset/view_sampler=evaluation \
dataset.view_sampler.index_path=mvsplat/assets/evaluation_index_re10k_video.json \
test.save_video=true \
test.save_image=false \
test.compute_scores=false \
test.video=checkpoints/model.safetensors

# also "bash inference.sh"

the generated video will be stored under outputs/test

BibTeX

@misc{wang2025videoscenedistillingvideodiffusion,
      title={VideoScene: Distilling Video Diffusion Model to Generate 3D Scenes in One Step}, 
      author={Hanyang Wang and Fangfu Liu and Jiawei Chi and Yueqi Duan},
      year={2025},
      eprint={2504.01956},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2504.01956}, 
}

Acknowledgements

This project is developed with several fantastic repos: ReconX, MVSplat, NoPoSplat, CogVideo, and CogvideX-Interpolation. Many thanks to these projects for their excellent contributions!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

VideoScene: Distilling Video Diffusion Model to Generate 3D Scenes in One Step

CVPR 2025 Hightlight 🔥

Installation

Acquiring Datasets

RealEstate10K and ACID

Downloading checkpoints

Running the Code

Gradio Demo

Inference

BibTeX

Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
assets		assets
mvsplat		mvsplat
noposplat		noposplat
videoscene		videoscene
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
demo.sh		demo.sh
inference.sh		inference.sh
requirements.txt		requirements.txt

License

hanyang-21/VideoScene

Folders and files

Latest commit

History

Repository files navigation

VideoScene: Distilling Video Diffusion Model to Generate 3D Scenes in One Step

CVPR 2025 Hightlight 🔥

Installation

Acquiring Datasets

RealEstate10K and ACID

Downloading checkpoints

Running the Code

Gradio Demo

Inference

BibTeX

Acknowledgements

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Languages

Packages