Mingtao Guo1 Guanyu Xing2 Yanci Zhang3 Yanli Liu1,3
1 National Key Laboratory of Fundamental Science on Synthetic Vision, Sichuan University, Chengdu, China
2 School of Cyber Science and Engineering, Sichuan University, Chengdu, China
3 College of Computer Science, Sichuan University, Chengdu, China
To replicate the main results (as shown in the Fig. 2), please follow the steps below:
You may modify the source image and driving video paths in inference.py
to test with your own inputs.
resources/source1.png--resources/driving1.mp4
resources/source2.png--resources/driving2.mp4
resources/source3.png--resources/driving3.mp4
resources/source4.png--resources/driving4.mp4
resources/source5.png--resources/driving5.mp4
Hardware Requirements
- GPU: NVIDIA RTX 4090 or equivalent
- VRAM: At least 12 GB recommended
- Inference Time: Approximately 4 minutes per 100-frame video on an RTX 4090
We are going to make all the following contents available:
- Model inference code
- Model checkpoint
- Training code
- Clone this repo locally:
git clone https://github.com/MingtaoGuo/Face-Reenactment-Video-Diffusion
cd Face-Reenactment-Video-Diffusion
- Install the dependencies:
conda create -n frvd python=3.8
conda activate frvd
- Install packages for inference:
pip install torch==2.2.2 torchvision==0.17.2 torchaudio==2.2.2 --index-url https://download.pytorch.org/whl/cu121
pip install -r requirements.txt
mkdir pretrained_weights
mkdir pretrained_weights/checkpoint-30000-14frames
mkdir pretrained_weights/facecropper
mkdir pretrained_weights/liveportrait
git-lfs install
git clone https://huggingface.co/MartinGuo/Face-Reenactment-Video-Diffusion
mv Face-Reenactment-Video-Diffusion/head_embedder.pth pretrained_weights/checkpoint-30000-14frames
mv Face-Reenactment-Video-Diffusion/warping_feature_mapper.pth pretrained_weights/checkpoint-30000-14frames
mv Face-Reenactment-Video-Diffusion/insightface pretrained_weights/facecropper
mv Face-Reenactment-Video-Diffusion/landmark.onnx pretrained_weights/facecropper
mv Face-Reenactment-Video-Diffusion/appearance_feature_extractor.pth pretrained_weights/liveportrait
mv Face-Reenactment-Video-Diffusion/motion_extractor.pth pretrained_weights/liveportrait
mv Face-Reenactment-Video-Diffusion/spade_generator.pth pretrained_weights/liveportrait
mv Face-Reenactment-Video-Diffusion/warping_module.pth pretrained_weights/liveportrait
git clone https://huggingface.co/stabilityai/stable-video-diffusion-img2vid-xt
mv stable-video-diffusion-img2vid-xt pretrained_weights
git clone https://huggingface.co/stabilityai/sd-vae-ft-mse
mv sd-vae-ft-mse pretrained_weights/stable-video-diffusion-img2vid-xt
The weights will be saved in the ./pretrained_weights
directory. Please note that the download process may take a significant amount of time.
Once completed, the weights should be arranged in the following structure:
./pretrained_weights/
|-- checkpoint-30000-14frames
| |-- warping_feature_mapper.pth
| |-- head_embedder.pth
|-- facecropper
| |-- insightface
| |-- landmark.onnx
|-- liveportrait
| |-- appearance_feature_extractor.pth
| |-- motion_extractor.pth
| |-- spade_generator.pth
| |-- warping_module.pth
|-- stable-video-diffusion-img2vid-xt
|-- sd-vae-ft-mse
| |-- config.json
| |-- diffusion_pytorch_model.bin
|-- feature_extractor
| |-- preprocessor_config.json
|-- scheduler
| |-- scheduler_config.json
|-- model_index.json
|-- unet
| |-- config.json
| |-- diffusion_pytorch_model.safetensors
| |-- diffusion_pytorch_model.fp16.safetensors
|-- image_encoder
| |-- config.json
| |-- model.safetensors
| |-- model.fp16.safetensors
python inference.py
After running inference.py
you'll get the results:
python train.py
We first thank to the contributors to the StableVideoDiffusion, SVD_Xtend and MimicMotion repositories, for their open research and exploration. Furthermore, our repo incorporates some codes from LivePortrait and InsightFace, and we extend our thanks to them as well.
This project is licensed under the MIT License. See the LICENSE file for details.