Skip to content

Ragnar-D/PantoMatrix

 
 

Repository files navigation

PantoMatrix
Generating Face and Body Animation from Speech

PantoMatrix is an Open-Source and research project to generate 3D body and face animation from speech. It works as an API inputs speech audio and outputs body and face motion parameters. You may transfer these motion parameters to other formats such as Iphone ARKit Blendshape Weights or Vicon Skeleton bvh files.

1. News

Animation Example

Welcome volunteers to contribute and collaborate on related topics. Feel free to submit the pull requests! Currently this repo is mainly maintained by haiyangliu1997@gmail.com in freetime since 2022.

  • [2025/01] Demo of how to set up inference and training is available on Colab!
  • [2025/01] New inference api, visualization api, evaluation api, training codebase, are available!
  • [2024/07] download smplx motion (in .npz) file, visualize with our blender addon and retarget to your avatar!
  • [2024/04] Thanks to @camenduru, Replicate version EMAGE is available! you can directly call EMAGE via API!
  • [2024/03] Thanks to @sunday9999 for speeding up the inference video rendering from 1000s to 25s!
  • [2024/02] Thanks to @wubowen416 for the scripts of automatic video visualization #83 during inference!
  • [2023/05] BEAT_GENEA is allowed for pretraining in GENEA2023! Thanks for GENEA's organizers!

2. Models and Tools List

Model Paper Inputs Outputs** Language (Train) Full Body FGD Weights
DisCo ACMMM 2022 Audio Upper + Hands English (Speaker 2) 2.233 Link
CaMN ECCV 2022 Audio Upper + Hands English (Speaker 2) 2.120 Link
EMAGE CVPR 2024 Audio Full Body + Face English (Speaker 2) 0.615 Link

** Outputs are in SMPLX and FLAME parameters.

Datasets BEAT2 (SMPLX+FLAME) BEAT (BVH + ARKit) Rendered Skeleton Videos
Blender Tools Blender Addon Character on BEAT Blender Render Scripts
SMPLX Tools SMPLX-FLAME Model ARKit2FLAME Sripts ARkit2FLAME Weights
Weights FGD on BEAT2 FGD on BEAT Text Vocab

3. Quick Start (Inference)

Approach 1: Using Hugging Face Space

Upload your audio and directly download the results from our Hugging Face Space.

Animation Example

Approach 2: Local Setup

Clone the repository and set up locally.
Demo of how to set up is available on Colab.

git clone https://github.com/PantoMatrix/PantoMatrix.git
cd PantoMatrix/

bash setup.sh
source /content/py39/bin/activate

python test_camn_audio.py --visualization 
# if you have trouble in install pytroch3d
# use --nopytorch3d, this will not render the 2D openpose style video
python test_camn_audio.py --visualization --nopytorch3d

# try differnet models with your data, put your audio in --audio_folder
# DisCo (ACMMM2022), upper body motion, with data resampling and rhythm content disentanglement.
python test_disco_audio.py --visualization --audio_folder ./examples/audio --save_folder ./examples/motion 
# BEAT (ECCV2022), upper body motion, with body2hands decoder
python test_camn_audio.py --visualization --audio_folder ./examples/audio --save_folder ./examples/motion
# EMAGE (CVPR2024), full body  + face animation 
python test_emage_audio.py --visualization --audio_folder ./examples/audio --save_folder ./examples/motion

Approach 3: Call API Directly

# copy the ./models folder iin your project folder
from .model.camn_audio import CaMNAudioModel

model = CaMNAudioModel.from_pretrained("H-Liu1997/huggingface-model/camn_audio")
model.cuda().eval()

import librosa
import numpy as np
import torch
# copy the ./emage_utils folder in your project folder
from emage_utils import beat_format_save

audio_np, sr = librosa.load("/audio_path.wav", sr=model.cfg.audio_sr)
audio = torch.from_numpy(audio_np).float().cuda().unsqueeze(0)

motion_pred = model(audio)["motion_axis_angle"]
motion_pred_np = motion_pred.cpu().numpy()
beat_format_save(motion_pred_np, "/result_motion.npz")

4. Visualization

When you run the test scripts, there is an parameter --visualization to automatic enable visualizaion.
Besides, you could also try visualiztion by the below.

Approach 1: Blender (Recommended)

Render the output using Blender by download the blender addon

Approach 2: 3D mesh

# render a npz file to a mesh video
from emage_utils import fast_render
fast_render.render_one_sequence_no_gt("/result_motion.npz", "/audio_path.wav", "/result_video.mp4", remove_global=True)
DisCo (Mesh) CaMN (Mesh) EMAGE (Mesh)

Approach 3: 2D OpenPose style video (Require Pytorch3D)

from trochvision.io import write_video
from emage_utils.format_transfer import render2d
from emage_utils import fast_render


motion_dict = np.load(npz_path, allow_pickle=True)
# face
v2d_face = render2d(motion_dict, (512, 512), face_only=True, remove_global=True)
write_video(npz_path.replace(".npz", "_2dface.mp4"), v2d_face.permute(0, 2, 3, 1), fps=30)
fast_render.add_audio_to_video(npz_path.replace(".npz", "_2dface.mp4"), audio_path, npz_path.replace(".npz", "_2dface_audio.mp4"))

# body
v2d_body = render2d(motion_dict, (720, 480), face_only=False, remove_global=True)
write_video(npz_path.replace(".npz", "_2dbody.mp4"), v2d_body.permute(0, 2, 3, 1), fps=30)
fast_render.add_audio_to_video(npz_path.replace(".npz", "_2dbody.mp4"), audio_path, npz_path.replace(".npz", "_2dbody_audio.mp4"))
DisCo (2D Pose) CaMN (2D Pose) EMAGE (2D Pose) EMAGE-Face (2D Pose)

5. Evaluation

For academic users, the evaluation code is organized into an evaluation API.

# copy the ./emage_evaltools folder into your folder
from emage_evaltools.metric import FGD, BC, L1Div, LVDFace, MSEFace

# init
fgd_evaluator = FGD(download_path="./emage_evaltools/")
bc_evaluator = BC(download_path="./emage_evaltools/", sigma=0.3, order=7)
l1div_evaluator= L1div()
lvd_evaluator = LVDFace()
mse_evaluator = MSEFace()

# Example usage
for motion_pred in all_motion_pred:
    # bc and l1 require position representation
    motion_position_pred = get_motion_rep_numpy(motion_pred, device=device, betas=betas)["position"] # t*55*3
    motion_position_pred = motion_position_pred.reshape(t, -1)
    # ignore the start and end 2s, this may for beat dataset only
    audio_beat = bc_evaluator.load_audio(test_file["audio_path"], t_start=2 * 16000, t_end=int((t-60)/30*16000))
    motion_beat = bc_evaluator.load_motion(motion_position_pred, t_start=60, t_end=t-60, pose_fps=30, without_file=True)
    bc_evaluator.compute(audio_beat, motion_beat, length=t-120, pose_fps=30)

    l1_evaluator.compute(motion_position_pred)
    
    face_position_pred = get_motion_rep_numpy(motion_pred, device=device, expressions=expressions_pred, expression_only=True, betas=betas)["vertices"] # t -1
    face_position_gt = get_motion_rep_numpy(motion_gt, device=device, expressions=expressions_gt, expression_only=True, betas=betas)["vertices"]
    lvd_evaluator.compute(face_position_pred, face_position_gt)
    mse_evaluator.compute(face_position_pred, face_position_gt)
    
    # fgd requires rotation 6d representaiton
    motion_gt = torch.from_numpy(motion_gt).to(device).unsqueeze(0)
    motion_pred = torch.from_numpy(motion_pred).to(device).unsqueeze(0)
    motion_gt = rc.axis_angle_to_rotation_6d(motion_gt.reshape(1, t, 55, 3)).reshape(1, t, 55*6)
    motion_pred = rc.axis_angle_to_rotation_6d(motion_pred.reshape(1, t, 55, 3)).reshape(1, t, 55*6)
    fgd_evaluator.update(motion_pred.float(), motion_gt.float())
    
metrics = {}
metrics["fgd"] = fgd_evaluator.compute()
metrics["bc"] = bc_evaluator.avg()
metrics["l1"] = l1_evaluator.avg()
metrics["lvd"] = lvd_evaluator.avg()
metrics["mse"] = mse_evaluator.avg()

Hyperparameters may vary depending on the dataset.
For example, for the BEAT dataset, we use (0.3, 7); for the TalkShow dataset, we use (0.5, 7). You may adjust based on your data.


6. Training

This new codebase only have the audio-only version model for better real-world applications.
For reproducing audio+text results in the paper, please check and reference the previous codebase below.

Model Inputs (Paper) Old Codebase Input (Current Codebase)
DisCo Audio + Text link Audio
CaMN Audio + Text + Emotion + Facial link Audio
EMAGE Audio + Text link Audio

Before Start

Environment setup, skip if you already setup the inference.

# if you didn't run test, run the below four commands.
# git clone https://github.com/PantoMatrix/PantoMatrix.git
# cd PantoMatrix/
# bash setup.sh
# source /content/py39/bin/activate

# Download the BEAT2
sudo apt-get update
sudo apt-get install git-lfs
git lfs install
git clone https://huggingface.co/datasets/H-Liu1997/BEAT2

Your folder should like follows for the correct path

/your_root/
|-- PantoMatrix
   |-- BEAT2
   `-- train_emage_audio.py

Method 1: Training EMAGE

# Preprocessing Extract the foot contact data
python ./datasets/foot_contact.py

# (todo) train the vqvae

# train the audio2motion model
torchrun --nproc_per_node 1 --nnodes 1 train_emage_audio.py --config ./configs/emage_audio.yaml --evaluation

Use these flags as needed:

  • --evaluation: Calculate the test metric.
  • --wandb: Activate logging to WandB.
  • --visualization: Render test results (slow; disable for efficiency).
  • --test: Test mode; load last checkpoint and evaluate.
  • --debug: Debug mode; iterate one data point for fast testing.

Method 2: Training CaMN

torchrun --nproc_per_node 1 --nnodes 1 train_camn_audio.py --config ./configs/camn_audio.yaml --evaluation

Method 3: Training DisCo

# (optional) Extract the cluster information
# python ./datasets/clustering.py

# train audio2motion 
torchrun --nproc_per_node 1 --nnodes 1 train_disco_audio.py --config ./configs/disco_audio.yaml --evaluation

Reference

EMAGE: Towards Unified Holistic Co-Speech Gesture Generation via Expressive Masked Audio Gesture Modeling (CVPR 2024)

Haiyang Liu*, Zihao Zhu*, Giorgio Becherini, Yichen Peng, Mingyang Su, You Zhou, Naoya Iwamoto, Bo Zheng, Michael J. Black

(*Equal Contribution)

- Project Page - Paper - Video - Code - Demo - Dataset - Blender Add-On -


BEAT: A Large-Scale Semantic and Emotional Multi-Modal Dataset for Conversational Gestures Synthesis (ECCV 2022)

- Project Page - Paper - Video - Code - Colab Demo - Dataset - Benchmark -


DisCo: Disentangled Implicit Content and Rhythm Learning for Diverse Co-Speech Gesture Synthesis (ACMMM 2022)

- Project Page - Paper - Video - Code -

About

PantoMatrix: Generating Face and Body Animation from Speech

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 99.8%
  • Shell 0.2%