Official implementation of AnchorSync:
We introduce AnchorSync, a novel diffusion-based framework for long video editing that explicitly tackles long-term consistency and short-term continuity in a unified architecture. Our approach decouples the editing process into two stages: (1) anchor frame editing, where a sparse set of representative frames are jointly edited through a progressive denoising process. To ensure global coherence, we inject a trainable Bidirectional Attention into a diffusion model to capture pairwise structural dependencies between distant frames, and perform Plug-and-Play (PnP) inversion and injection for controllable editing; and (2) intermediate frame interpolation, where we leverage a video diffusion model equipped with a newly trained multimodal ControlNet to guide generation using both optical flow and edge maps, enabling temporally smooth and structure-aware transitions between anchor frames.
conda create -n anchorsync python=3.10
conda activate anchorsync
python3 -m pip install torch==2.2.1 torchvision==0.17.1 torchaudio==2.2.1 --index-url https://download.pytorch.org/whl/cu118
python3 -m pip install -r requirements.txt --no-deps
python3 -m pip install xformers==0.0.25 --index-url https://download.pytorch.org/whl/cu118
Download more than 10000 videos from Pandas-70m Dataset in your path and change the param video_folder.
Download stable-diffusion-v1-5, Canny controlnet for sd 1.5 and stable-video-diffusion-img2vid-xt. Change corresponding checkpoint path.
First, train joint diffusion for first step:
bash train_models/train_scripts/train_joint_frame_lora.sh
Second, train multimodal controlnet for SVD:
bash train_models/train_scripts/train_controlnet_canny+flow.sh
If you do not train, you can download ckpt from link:
download joint frame lora in {joint_lora_path}, download multimodal controlnet in {multimodal_controlnet_path}.
Put your videos in data/, named "{case_name}.mp4"
python run_models/run_inference_joint_frame_video_fusion_guidance_inversion.py --case_name "mountain-new" --invert_prompt "Vast Mountain Landscape under Clear Blue Sky" --joint_lora_dir "output_dir/joint_frame_lora"
python run_models/run_inference_joint_frame_video_fusion_guidance_forward.py --case_name "mountain-new" --invert_prompt "Vast Mountain Landscape under Clear Blue Sky" --prompt "Chinese Ink Wash Painting of Mountain Landscape under Clear Sky" --joint_lora_dir "output_dir/joint_frame_lora"
python run_models/run_inference_trans_controlnet_canny_flow_video_fusion_guidance_pnp.py --case_name "mountain-new" --prompt "Chinese Ink Wash Painting of Mountain Landscape under Clear Sky" --multimodal_controlnet_path "output_dir/multimodal-controlnet"
python run_models/run_inference_joint_frame_video_fusion_guidance_inversion.py --case_name "forest-2" --invert_prompt "A forest path in morning sunlight with green trees and long shadows" --joint_lora_dir "output_dir/joint_frame_lora"
python run_models/run_inference_joint_frame_video_fusion_guidance_forward.py --case_name "forest-2" --invert_prompt "A forest path in morning sunlight with green trees and long shadows" --prompt "A forest path covered in snow during a winter sunrise" --joint_lora_dir "output_dir/joint_frame_lora"
python run_models/run_inference_trans_controlnet_canny_flow_video_fusion_guidance_pnp.py --case_name "forest-2" --prompt "A forest path covered in snow during a winter sunrise" --multimodal_controlnet_path "output_dir/multimodal-controlnet"
This codebase is built upon Stable Diffusion, ControlNet and Stable Video Diffusion. We thank all the authors for their great work and repos!