Skip to content

antgroup/echomimic_v3

Repository files navigation

简体中文 | English

EchoMimicV3: 1.3B Parameters are All You Need for Unified Multi-Modal and Multi-Task Human Animation

Terminal Technology Department, Alipay, Ant Group.

1Core Contributor  2Corresponding Authors

📣 Updates

  • [2025.08.12] 🔥🚀 12G VRAM is All YOU NEED to Generate Video. Please use this GradioUI. Check the tutorial from @gluttony-10. Thanks for the contribution.
  • [2025.08.12] 🔥 EchoMimicV3 can run on 16G VRAM using ComfyUI. Thanks @smthemex for the contribution.
  • [2025.08.09] 🔥 We release our models on ModelScope.
  • [2025.08.08] 🔥 We release our codes on GitHub and models on Huggingface.
  • [2025.07.08] 🔥 Our paper is in public on arxiv.

🌅 Gallery

teaser_github.mp4
hoi_github.mp4

Chinese Driven Audio

01.mp4
02.mp4
03.mp4
04.mp4

For more demo videos, please refer to the project page

Quick Start

Environment Setup

  • Tested System Environment: Centos 7.2/Ubuntu 22.04, Cuda >= 12.1
  • Tested GPUs: A100(80G) / RTX4090D (24G) / V100(16G)
  • Tested Python Version: 3.10 / 3.11

🛠️Installation for Windows

Please use the one-click installation package to get started quickly for Quantified version.

🛠️Installation for Linux

1. Create a conda environment

conda create -n echomimic_v3 python=3.10
conda activate echomimic_v3

2. Other dependencies

pip install -r requirements.txt

🧱Model Preparation

Models Download Link Notes
Wan2.1-Fun-V1.1-1.3B-InP 🤗 Huggingface Base model
wav2vec2-base 🤗 Huggingface Audio encoder
EchoMimicV3-preview 🤗 Huggingface Our weights
EchoMimicV3-preview 🤗 ModelScope Our weights

-- The weights is organized as follows.

./models/
├── Wan2.1-Fun-V1.1-1.3B-InP
├── wav2vec2-base-960h
└── transformer
    └── diffusion_pytorch_model.safetensors

🔑 Quick Inference

python infer.py

For Quantified GradioUI version:

python app_mm.py

images, audios, masks and prompts are provided in datasets/echomimicv3_demos

Tips

  • Audio CFG: Audio CFG audio_guidance_scale works optimally between 2~3. Increase the audio CFG value for better lip synchronization, while decreasing the audio CFG value can improve the visual quality.
  • Text CFG: Text CFG guidance_scale works optimally between 3~6. Increase the text CFG value for better prompt following, while decreasing the text CFG value can improve the visual quality.
  • TeaCache: The optimal range for teacache_threshold is between 0~0.1.
  • Sampling steps: 5 steps for talking head, 15~25 steps for talking body.
  • ​Long video generation: If you want to generate a video longer than 138 frames, you can use Long Video CFG.
  • Try setting partial_video_length to 81, 65 or smaller to reduce VRAM usage.

📝 TODO List

Status Milestone
The inference code of EchoMimicV3 meet everyone on GitHub
EchoMimicV3-preview model on HuggingFace
EchoMimicV3-preview model on ModelScope
🚀 ModelScope Space
🚀 Preview version Pretrained models trained on English and Chinese on ModelScope
🚀 720P Pretrained models trained on English and Chinese on HuggingFace
🚀 720P Pretrained models trained on English and Chinese on ModelScope
🚀 The training code of EchoMimicV3 meet everyone on GitHub

🚀 EchoMimic Series

  • EchoMimicV3: 1.3B Parameters are All You Need for Unified Multi-Modal and Multi-Task Human Animation. GitHub
  • EchoMimicV2: Towards Striking, Simplified, and Semi-Body Human Animation. GitHub
  • EchoMimicV1: Lifelike Audio-Driven Portrait Animations through Editable Landmark Conditioning. GitHub

📒 Citation

If you find our work useful for your research, please consider citing the paper :

@misc{meng2025echomimicv3,
  title={EchoMimicV3: 1.3B Parameters are All You Need for Unified Multi-Modal and Multi-Task Human Animation},
  author={Rang Meng, Yan Wang, Weipeng Wu, Ruobing Zheng, Yuming Li, Chenguang Ma},
  year={2025},
  eprint={2507.03905},
  archivePrefix={arXiv}
}

Reference

📜 License

The models in this repository are licensed under the Apache 2.0 License. We claim no rights over the your generated contents, granting you the freedom to use them while ensuring that your usage complies with the provisions of this license. You are fully accountable for your use of the models, which must not involve sharing any content that violates applicable laws, causes harm to individuals or groups, disseminates personal information intended for harm, spreads misinformation, or targets vulnerable populations.

🌟 Star History

Star History Chart