Hanyang Wang*,
Fangfu Liu*,
Jiawei Chi,
Yueqi Duan
*Equal Contribution.
Tsinghua University
teaser_video.mp4
Building on ReconX, VideoScene has achieved a turbo-version advancement.
To get started, clone this project, create a conda virtual environment using Python 3.10+, and install the requirements:
- Clone VideoScene.
git clone https://github.com/hanyang-21/VideoScene
cd VideoScene
- Create the environment, here we show an example using conda.
conda create -y -n videoscene python=3.10
conda activate videoscene
pip install torch==2.1.2 torchvision==0.16.2 torchaudio==2.1.2 --index-url https://download.pytorch.org/whl/cu118
pip install -r requirements.txt
- Optional, compile the cuda kernels for RoPE (as in CroCo v2).
# NoPoSplat relies on RoPE positional embeddings for which you can compile some cuda kernels for faster runtime.
cd src/model/encoder/backbone/croco/curope/
python setup.py build_ext --inplace
cd ../../../../../..
Our VideoScene uses the same training datasets as pixelSplat. Below we quote pixelSplat's detailed instructions on getting datasets.
pixelSplat was trained using versions of the RealEstate10k and ACID datasets that were split into ~100 MB chunks for use on server cluster file systems. Small subsets of the Real Estate 10k and ACID datasets in this format can be found here. To use them, simply unzip them into a newly created
datasets
folder in the project root directory.
If you would like to convert downloaded versions of the Real Estate 10k and ACID datasets to our format, you can use the scripts here. Reach out to us (pixelSplat) if you want the full versions of our processed datasets, which are about 500 GB and 160 GB for Real Estate 10k and ACID respectively.
-
download our pretrained weights (
VideoScene/checkpoints/model.safetensors
andVideoScene/checkpoints/prompt_embeds.pt
), and save them tocheckpoints
. -
for customized image inputs, get the NoPoSplat pretrained models, and save them to
checkpoints/noposplat
. -
for RealEstate10K datasets, get the MVSplat pretrained models, and save them to
checkpoints/mvsplat
.
In this demo, you can run VideoScene on your machine to generate a video with unposed input views.
- select image pairs that depicts the same scene and hit "RUN" for a video of the scene.
python -m noposplat.src.app \
checkpointing.load=checkpoints/noposplat/mixRe10kDl3dv_512x512.ckpt \
test.video=checkpoints/model.safetensors
# also "bash demo.sh"
- the generated video will be stored under
outputs/gradio
To generate videos on RealEstate10K dataseets, we use a MVSplat pretrained model,
- run the following:
# re10k
python -m mvsplat.src.main +experiment=re10k \
checkpointing.load=checkpoints/mvsplat/re10k.ckpt \
mode=test \
dataset/view_sampler=evaluation \
dataset.view_sampler.index_path=mvsplat/assets/evaluation_index_re10k_video.json \
test.save_video=true \
test.save_image=false \
test.compute_scores=false \
test.video=checkpoints/model.safetensors
# also "bash inference.sh"
- the generated video will be stored under
outputs/test
@misc{wang2025videoscenedistillingvideodiffusion,
title={VideoScene: Distilling Video Diffusion Model to Generate 3D Scenes in One Step},
author={Hanyang Wang and Fangfu Liu and Jiawei Chi and Yueqi Duan},
year={2025},
eprint={2504.01956},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2504.01956},
}
This project is developed with several fantastic repos: ReconX, MVSplat, NoPoSplat, CogVideo, and CogvideX-Interpolation. Many thanks to these projects for their excellent contributions!