简体中文 | English
EchoMimicV3: 1.3B Parameters are All You Need for Unified Multi-Modal and Multi-Task Human Animation
1Core Contributor 2Corresponding Authors
- [2025.08.12] 🔥🚀 12G VRAM is All YOU NEED to Generate Video. Please use this GradioUI. Check the tutorial from @gluttony-10. Thanks for the contribution.
- [2025.08.12] 🔥 EchoMimicV3 can run on 16G VRAM using ComfyUI. Thanks @smthemex for the contribution.
- [2025.08.09] 🔥 We release our models on ModelScope.
- [2025.08.08] 🔥 We release our codes on GitHub and models on Huggingface.
- [2025.07.08] 🔥 Our paper is in public on arxiv.
teaser_github.mp4 |
hoi_github.mp4 |
01.mp4 |
02.mp4 |
03.mp4 |
04.mp4 |
For more demo videos, please refer to the project page
- Tested System Environment: Centos 7.2/Ubuntu 22.04, Cuda >= 12.1
- Tested GPUs: A100(80G) / RTX4090D (24G) / V100(16G)
- Tested Python Version: 3.10 / 3.11
Please use the one-click installation package to get started quickly for Quantified version.
conda create -n echomimic_v3 python=3.10
conda activate echomimic_v3
pip install -r requirements.txt
Models | Download Link | Notes |
---|---|---|
Wan2.1-Fun-V1.1-1.3B-InP | 🤗 Huggingface | Base model |
wav2vec2-base | 🤗 Huggingface | Audio encoder |
EchoMimicV3-preview | 🤗 Huggingface | Our weights |
EchoMimicV3-preview | 🤗 ModelScope | Our weights |
-- The weights is organized as follows.
./models/
├── Wan2.1-Fun-V1.1-1.3B-InP
├── wav2vec2-base-960h
└── transformer
└── diffusion_pytorch_model.safetensors
python infer.py
For Quantified GradioUI version:
python app_mm.py
images, audios, masks and prompts are provided in datasets/echomimicv3_demos
- Audio CFG: Audio CFG
audio_guidance_scale
works optimally between 2~3. Increase the audio CFG value for better lip synchronization, while decreasing the audio CFG value can improve the visual quality.- Text CFG: Text CFG
guidance_scale
works optimally between 3~6. Increase the text CFG value for better prompt following, while decreasing the text CFG value can improve the visual quality.- TeaCache: The optimal range for
teacache_threshold
is between 0~0.1.- Sampling steps: 5 steps for talking head, 15~25 steps for talking body.
- Long video generation: If you want to generate a video longer than 138 frames, you can use Long Video CFG.
- Try setting
partial_video_length
to 81, 65 or smaller to reduce VRAM usage.
Status | Milestone |
---|---|
✅ | The inference code of EchoMimicV3 meet everyone on GitHub |
✅ | EchoMimicV3-preview model on HuggingFace |
✅ | EchoMimicV3-preview model on ModelScope |
🚀 | ModelScope Space |
🚀 | Preview version Pretrained models trained on English and Chinese on ModelScope |
🚀 | 720P Pretrained models trained on English and Chinese on HuggingFace |
🚀 | 720P Pretrained models trained on English and Chinese on ModelScope |
🚀 | The training code of EchoMimicV3 meet everyone on GitHub |
- EchoMimicV3: 1.3B Parameters are All You Need for Unified Multi-Modal and Multi-Task Human Animation. GitHub
- EchoMimicV2: Towards Striking, Simplified, and Semi-Body Human Animation. GitHub
- EchoMimicV1: Lifelike Audio-Driven Portrait Animations through Editable Landmark Conditioning. GitHub
If you find our work useful for your research, please consider citing the paper :
@misc{meng2025echomimicv3,
title={EchoMimicV3: 1.3B Parameters are All You Need for Unified Multi-Modal and Multi-Task Human Animation},
author={Rang Meng, Yan Wang, Weipeng Wu, Ruobing Zheng, Yuming Li, Chenguang Ma},
year={2025},
eprint={2507.03905},
archivePrefix={arXiv}
}
- Wan2.1: https://github.com/Wan-Video/Wan2.1/
- VideoX-Fun: https://github.com/aigc-apps/VideoX-Fun/
The models in this repository are licensed under the Apache 2.0 License. We claim no rights over the your generated contents, granting you the freedom to use them while ensuring that your usage complies with the provisions of this license. You are fully accountable for your use of the models, which must not involve sharing any content that violates applicable laws, causes harm to individuals or groups, disseminates personal information intended for harm, spreads misinformation, or targets vulnerable populations.