We propose Sketch3DVE, a sketch-based 3D-aware video editing method to enable detailed local manipulation of videos with significant viewpoint changes. Please check our project page and paper for more information.
Input Video | Edited Image | Generated video |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
- [2025.08.03]: 🔥🔥 Release code and model weights.
- [2025.08.05]: Launch the project page and update the arXiv preprint.
Model | Resolution | GPU Mem. & Inference Time (A100, ddim 50steps) | Checkpoint |
---|---|---|---|
Sketch3DVE | 720x480 | ~27G & 53s | Hugging Face |
Our method is built based on pretrained CogVideoX-2b model. We add an additional sketch conditional network for editing.
Currently, our Sketch3DVE can support generating videos of up to 49 frames with a resolution of 720x480. For editing, we assume the input video has 49 frames with a resolution of 720x480.
The inference time can be reduced by using fewer DDIM steps.
We test the code on CUDA 11.8 and Python 3.10, so we recommend using the same environment.
conda create -n sketch3dve python=3.10
conda activate sketch3dve
pip install -r requirements.txt
conda install https://anaconda.org/pytorch3d/pytorch3d/0.7.8/download/linux-64/pytorch3d-0.7.8-py310_cu118_pyt240.tar.bz2
Notably, diffusers==0.30.1
is required.
Download pretrained Dust3R model [Download Link] and DepthAnythingV2 model [hugging face] and LLaVA model [hugging face] and pretrained CogVideoX-2b [hugging face] video generation model. Then, modify the --dust3r_model_path
and --depthanything_model_path
and --Llava_model_path
and --basemodel_ckpt_path
and --controlnet_ckpt_path
(see the download link above) in examples/xxx/test.sh to corresponding paths.
Edit example videos.
cd examples/beach
sh test.sh
Please consider citing our paper if our code is useful:
@inproceedings{
author = {Liu, Feng-Lin and Li, Shi-Yang and Cao, Yan-Pei and Fu, Hongbo and Gao, Lin},
title = {Sketch3DVE: Sketch-based 3D-Aware Scene Video Editing},
year = {2025},
booktitle = {Proceedings of the Special Interest Group on Computer Graphics and Interactive Techniques Conference Conference Papers},
articleno = {152},
numpages = {12},
keywords = {Sketch-based interaction, video generation, video editing, video diffusion models},
series = {SIGGRAPH Conference Papers '25}
}
We thanks the projects of video generation models CogVideoX and ControlNet and Dust3R and DepthAnythingV2. Our code introduction is modified from ViewCrafter template.
Our framework achieves interesting sketch-based 3D-Aware video editing, but due to the variaity of generative video prior, the success rate is not guaranteed. Different random seeds can be tried to generate the best video generation results.
This project strives to impact the domain of AI-driven video generation positively. Users are granted the freedom to create videos using this tool, but they are expected to comply with local laws and utilize it responsibly. The developers do not assume any responsibility for potential misuse by users.