HuMo: Human-Centric Video Generation via Collaborative Multi-Modal Conditioning

Liyang Chen^*, Tianxiang Ma^*, Jiawei Liu, Bingchuan Li^†,
Zhuowei Chen, Lijie Liu, Xu He, Gen Li, Qian He, Zhiyong Wu^§
^*Equal contribution, ^†Project lead, ^§Corresponding author
Tsinghua University | Intelligent Creation Team, ByteDance

✨ Key Features

HuMo is a unified, human-centric video generation framework designed to produce high-quality, fine-grained, and controllable human videos from multimodal inputs—including text, images, and audio. It supports strong text prompt following, consistent subject preservation, synchronized audio-driven motion.

VideoGen from Text-Image - Customize character appearance, clothing, makeup, props, and scenes using text prompts combined with reference images.

VideoGen from Text-Audio - Generate audio-synchronized videos solely from text and audio inputs, removing the need for image references and enabling greater creative freedom.

VideoGen from Text-Image-Audio - Achieve the higher level of customization and control by combining text, image, and audio guidance.

📑 Todo List

⚡️ Quickstart

Installation

conda create -n humo python=3.11
conda activate humo
pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu124
pip install flash_attn==2.6.3
pip install -r requirements.txt
conda install -c conda-forge ffmpeg

Model Preparation

Models	Download Link	Notes
HuMo-17B	🤗 Huggingface	Supports 480P & 720P
HuMo-1.7B	🤗 Huggingface	To be released soon
Wan-2.1	🤗 Huggingface	VAE & Text encoder
Whisper-large-v3	🤗 Huggingface	Audio encoder
Audio separator	🤗 Huggingface	Remove background noise (optional)

Download models using huggingface-cli:

huggingface-cli download Wan-AI/Wan2.1-T2V-1.3B --local-dir ./weights/Wan2.1-T2V-1.3B
huggingface-cli download bytedance-research/HuMo --local-dir ./weights/HuMo
huggingface-cli download openai/whisper-large-v3 --local-dir ./weights/whisper-large-v3
huggingface-cli download huangjackson/Kim_Vocal_2 --local-dir ./weights/audio_separator

Run Multimodal-Condition-to-Video Generation

Our model is compatible with both 480P and 720P resolutions. 720P inference will achieve much better quality.

Some tips

Please prepare your text, reference images and audio as described in test_case.json.

We support Multi-GPU inference using FSDP + Sequence Parallel.

The model is trained on 97-frame videos at 25 FPS. Generating video longer than 97 frames may degrade the performance. We will provide a new checkpoint for longer generation.

Configure HuMo

HuMo’s behavior and output can be customized by modifying generate.yaml configuration file.
The following parameters control generation length, video resolution, and how text, image, and audio inputs are balanced:

generation:
  frames: <int>                 # Number of frames for the generated video.
  scale_a: <float>              # Strength of audio guidance. Higher = better audio-motion sync.
  scale_t: <float>              # Strength of text guidance. Higher = better adherence to text prompts.
  mode: "TA"                    # Input mode: "TA" for text+audio; "TIA" for text+image+audio.
  height: 720                   # Video height (e.g., 720 or 480).
  width: 1280                   # Video width (e.g., 1280 or 832).

diffusion:
  timesteps:
    sampling:
      steps: 50                 # Number of denoising steps. Lower (30–40) = faster generation.

1. Text-Audio Input

bash infer_ta.sh

2. Text-Image-Audio Input

bash infer_tia.sh

Acknowledgements

Our work builds upon and is greatly inspired by several outstanding open-source projects, including Phantom, SeedVR, MEMO, Hallo3, OpenHumanVid, and Whisper. We sincerely thank the authors and contributors of these projects for generously sharing their excellent codes and ideas.

⭐ Citation

If HuMo is helpful, please help to ⭐ the repo.

If you find this project useful for your research, please consider citing our paper.

BibTeX

@misc{chen2025humo,
      title={HuMo: Human-Centric Video Generation via Collaborative Multi-Modal Conditioning}, 
      author={Liyang Chen and Tianxiang Ma and Jiawei Liu and Bingchuan Li and Zhuowei Chen and Lijie Liu and Xu He and Gen Li and Qian He and Zhiyong Wu},
      year={2025},
      eprint={2509.08519},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2509.08519}, 
}

📧 Contact

If you have any comments or questions regarding this open-source project, please open a new issue or contact Liyang Chen and Tianxiang Ma.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
assets		assets
examples		examples
humo		humo
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
infer_ta.sh		infer_ta.sh
infer_tia.sh		infer_tia.sh
main.py		main.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

HuMo: Human-Centric Video Generation via Collaborative Multi-Modal Conditioning

✨ Key Features

📑 Todo List

⚡️ Quickstart

Installation

Model Preparation

Run Multimodal-Condition-to-Video Generation

Configure HuMo

1. Text-Audio Input

2. Text-Image-Audio Input

Acknowledgements

⭐ Citation

BibTeX

📧 Contact

About

Uh oh!

Releases

Packages

Contributors 3

Uh oh!

Languages

License

Phantom-video/HuMo

Folders and files

Latest commit

History

Repository files navigation

HuMo: Human-Centric Video Generation via Collaborative Multi-Modal Conditioning

✨ Key Features

📑 Todo List

⚡️ Quickstart

Installation

Model Preparation

Run Multimodal-Condition-to-Video Generation

Configure HuMo

1. Text-Audio Input

2. Text-Image-Audio Input

Acknowledgements

⭐ Citation

BibTeX

📧 Contact

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Languages

Packages