Generative AI for Character Animation: A Comprehensive Survey of Techniques, Applications, and Future Directions

This repository is designed to collect and categorize papers, datasets, and resources related to generative AI for character animation based on our survey. As advances in generative AI continue to transform animation from realistic facial synthesis to dynamic gesture and motion generation, this resource will be continuously updated to serve as a comprehensive guide for researchers and practitioners. Given the rapid growth in this field, we will continuously update both the paper and this repository to serve as a resource for researchers working on future projects.

📢 News

April 27, 2025: We release the first version of our survey.

Feel free to cite, contribute, or open a pull request to add recent related papers!

📑 List of Contents

📝 Abstract
🗺 Overview
🌳 Taxonomy
📚 Background
- 🤖 Models
- 📊 Metrics
👨 Face
- 🗂 Datasets
- 🤖 Models
😃 Expression
- 🗂 Datasets
- 🤖 Models
  - 🎙 Speech-Driven & Multimodal Expression Generation
  - 🔁 Expression Retargeting & Motion Transfer
🖼 Image
- 🗂 Datasets
- 🤖 Models
👤 Avatar
- 🗂 Datasets
- 🤖 Models
🤝 Gesture
- 🗂 Datasets
- 🤖 Models
🎥 Motion
- 🗂 Datasets
- 🤖 Models
📦 Object
- 🗂 Datasets
- 🤖 Models
🧵 Texture
- 🗂 Datasets
- 🤖 Models

📝 Abstract

Generative AI is transforming various fields, including art, gaming, and animation. One of its most significant applications lies in animation, where advances in artificial intelligence—such as foundation models and diffusion models—have driven remarkable progress, significantly reducing the time and cost of content creation. Characters are central components of animations involving elements such as motion, emotions, gestures, and facial expressions. Rapid and wide-ranging developments in AI-driven animation technologies have made it challenging to maintain an overarching view of progress in the field, highlighting the need for a comprehensive survey to integrate and contextualize these advancements.

This survey offers a comprehensive review of the state-of-the-art generative AI applications for animated character design and behavior, integrating a wide range of aspects often examined in isolation (e.g., avatars, gestures, and facial expressions). Unlike previous studies, it provides a unified perspective covering all major applications of generative AI in character animation. The survey begins with foundational concepts and introduces evaluation metrics tailored to this domain, then explores key areas such as facial animation, image synthesis, avatar generation, gesture modeling, motion synthesis, expression rendering, and texture generation. Finally, it addresses the main challenges and outlines future research directions, offering a roadmap to advance AI-driven character animation technologies. This survey aims to serve as a resource for researchers and developers in generative AI for animation and related fields.

🗺 Overview

🌳 Taxonomy

📚 Background

🤖 Models

🎨 Computer Graphics Models

SMPL 🔗
A popular parametric model representing 3D human body geometry using a low-dimensional representation for shape (β) and pose (θ).
- SMPL+H
  An extension of SMPL that incorporates detailed hand modeling by introducing hand joint parameters (θhands).
- SMPL-X
  Further extends SMPL+H by including facial expressions along with detailed hand and body modeling for full-body human representation.
SMIL (Skinned Multi-Infant Linear Model) 🔗
A model developed specifically for infants, addressing challenges in capturing non-cooperative subjects with low-quality RGB-D data.
SMAL (Skinned Multi-Animal Linear Model) 🔗
Designed for 3D modeling of animals, enabling the creation of a shape space from a few scans of diverse species.

👀 Vision

Convolutional Neural Networks (CNNs) 🔗
CNNs are specialized for image-related tasks by using convolution, pooling, and fully connected layers.
3D CNNs 🔗
Extend CNNs to process volumetric data (e.g., videos, MRI scans) by using 3D convolutional kernels.
U-Net 🔗
A U-shaped network architecture designed for biomedical image segmentation, known for its efficient denoising and skip connections.
Inception 🔗
Introduces multi-scale processing via parallel convolutions (1x1, 3x3, 5x5) for improved feature extraction.
VGG 🔗
Evaluate the impact of increasing CNN depth using very small (3x3) filters to capture complex visual features.
ResNet 🔗
Introduces residual learning with shortcut connections to enable training of very deep networks (up to 152 layers).
Vision Transformers (ViTs) 🔗
Applies the self-attention mechanism to image patches, offering competitive performance on image recognition tasks.

📝 Language Models

RNNs 🔗
General recurrent neural networks for sequence modeling.
Bidirectional RNNs (BRNNs) 🔗
Process sequences in both directions to leverage past and future context.
Encoder-Decoder Frameworks 🔗
Used for tasks like machine translation by compressing sequences into a fixed-length vector.
LSTMs 🔗
Introduces memory cells and gating mechanisms to capture long-term dependencies.
GRUs 🔗
A streamlined variant of LSTMs merging input and forget gates into an update gate.
Attention Mechanisms 🔗
Allows models to dynamically focus on different parts of the input sequence.
Transformers 🔗
Utilize self-attention to process sequences without recurrence.
BERT 🔗
Bidirectional Encoder Representations from Transformers for deep language understanding.
GPT Series:
- GPT-1 🔗
- GPT-2 🔗
- GPT-3 🔗
- GPT-3.5 / ChatGPT
- InstructGPT 🔗
- GPT-4 🔗
- GPT-4-O 🔗
PoseGPT 🔗
Specialized for pose estimation in video generation.
GestureGPT 🔗
Extends the GPT framework to generate realistic human gestures based on text or audio input.
MotionGPT 🔗
Designed for generating motion sequences.

🕒 Temporal Sequence Modeling

Temporal Convolutional Networks (TCNs) 🔗
Use causal and dilated convolutions to model sequential data efficiently.
Transformer-XL 🔗
Extends Transformers with a segment-level recurrence mechanism to capture long-range dependencies.
ConvLSTM 🔗
Combines CNNs and LSTM units to capture both spatial and temporal dynamics in spatiotemporal data.

🗣 Speech Models

WaveNet 🔗
An autoregressive model for raw audio synthesis using dilated causal convolutions.
Tacotron 🔗
A sequence-to-sequence TTS model that converts text to mel-spectrograms via attention.
Tacotron 2 🔗
Combines Tacotron with a WaveNet vocoder for end-to-end, high-fidelity speech synthesis.
FastSpeech 🔗
A non-autoregressive TTS model using transformers for parallel synthesis to reduce latency.
FastSpeech 2 🔗
Improves FastSpeech by introducing variance predictors for pitch, energy, and duration for more natural speech.
Wave2Vec 🔗
A self-supervised framework for learning robust speech representations directly from raw audio.
Wave2Vec 2.0 🔗
Enhances Wave2Vec with quantization and contextual embeddings to improve ASR performance.
HuBERT 🔗
Uses clustering-based pseudo-labeling and masked prediction to learn effective speech representations.
Whisper 🔗
A transformer-based model for multilingual ASR, translation, and transcription with zero-shot capabilities.
SeamlessM4T 🔗
An end-to-end model for universal speech translation and generation that preserves speaker emotion via attention.

🎭 Additional Generative Models

GANs (Generative Adversarial Networks) 🔗
An adversarial framework where a generator and discriminator engage in a minimax game to synthesize realistic data.
CycleGAN 🔗
Enables unpaired image-to-image translation by enforcing cycle consistency between two domains.
Autoencoders
A general framework that compresses input data into a latent representation and reconstructs it for unsupervised learning.
Variational Autoencoders (VAEs) 🔗
Probabilistic autoencoders that regularize the latent space using KL divergence to generate new data samples.
Vector Quantized VAEs (VQ-VAEs) 🔗
Enhances VAEs by discretizing the latent space with a codebook for more structured representations.
NeRF (Neural Radiance Fields) 🔗
Learns an implicit 3D scene representation via volumetric rendering for novel view synthesis.
3D Gaussian Splatting (3DGS) 🔗
Represents 3D scenes with a collection of Gaussian functions for efficient real-time rendering.
Denoising Diffusion Probabilistic Models (DDPMs) 🔗
Generates high-quality outputs by iteratively denoising data from a latent space.
ControlNet 🔗
Augments diffusion models with auxiliary conditioning inputs for precise image generation.
DALL-E 🔗
An autoregressive transformer that generates images from text by jointly modeling text and image tokens.

📊 Metrics

✅ Quality and Realism of Generated Output

These metrics assess how natural, realistic, and perceptually convincing the generated content appears.

Metric	Description	Formula
Fréchet Inception Distance (FID)	Measures statistical distance between real and generated images.	$\text{FID} = \lVert \mu_r - \mu_g \rVert^2 + \text{tr}(\Sigma_r + \Sigma_g - 2(\Sigma_r\Sigma_g)^{1/2})$
CLIP Score	Evaluates semantic similarity between generated images and textual descriptions.	$\text{CLIPScore} = \frac{t \cdot i}{\lVert t \rVert \lVert i \rVert}$
Mean Squared Error (MSE)	Measures pixel-wise difference between generated and real images.	$\text{MSE} = \frac{1}{n}\sum_{i=1}^{n}(x_i - y_i)^2$
Learned Perceptual Image Patch Similarity (LPIPS)	Assesses perceptual similarity using deep feature embeddings.	$\text{LPIPS}(x,y) = \sum_l \frac{1}{H_l W_l} \sum_{h=1}^{H_l}\sum_{w=1}^{W_l}\lVert \phi_l(x)^{h,w}-\phi_l(y)^{h,w} \rVert_2^2$
Identity Consistency	Ensures identity preservation in generated faces by computing cosine similarity.	$\text{IC} = \frac{1}{N}\sum_{i=1}^{N} \text{cosine-sim}\Bigl(f(x_i), f(y_i)\Bigr)$
Fréchet Gesture Distance (FGD)	Measures statistical differences between real and generated gesture distributions.	$\text{FGD} = \lVert \mu_{\text{real}} - \mu_{\text{gen}} \rVert^2 + \text{tr}(\Sigma_{\text{real}} + \Sigma_{\text{gen}} - 2(\Sigma_{\text{real}}\Sigma_{\text{gen}})^{1/2})$
CLIP Fréchet Inception Distance (CLIP FID)	A CLIP-based extension of FID for assessing generated textures.	$\text{CLIPFID} = \lVert \mu_{\text{CLIP,real}} - \mu_{\text{CLIP,gen}} \rVert^2 + \text{tr}(\Sigma_{\text{CLIP,real}} + \Sigma_{\text{CLIP,gen}} - 2(\Sigma_{\text{CLIP,real}} \Sigma_{\text{CLIP,gen}})^{1/2})$

🔄 Diversity and Multimodality

These metrics assess whether the generative model produces diverse and varied outputs.

Metric	Description	Formula
Diversity	Quantifies variation between independently sampled subsets of generated outputs.	$\text{Diversity} = \frac{1}{N}\sum_{i=1}^{N}\lVert x_i - x'_i \rVert^2$
Multimodality	Measures diversity of outputs within the same action class.	$\text{Multimodality} = \frac{1}{C \cdot N}\sum_{c=1}^{C}\sum_{n=1}^{N}\lVert x_{c,n} - x'_{c,n} \rVert^2$
Average Pairwise Distance (APD)	Evaluates diversity across generated samples.	$\text{APD} = \frac{1}{N(N-1)}\sum_{i\neq j} \lVert x_i - x_j \rVert$

🎯 Relevance and Accuracy

These metrics assess how well the generated content aligns with ground truth data.

Metric	Description	Formula
Mean Absolute Joint Error (MAJE)	Measures positional accuracy of generated motion.	$\text{MAJE} = \frac{1}{n}\sum_{i=1}^{n}\lvert x_i - y_i \rvert$
Probability of Correct Keypoints (PCK)	Evaluates the percentage of correct keypoint predictions.	$\text{PCK} = \frac{\text{number of correct keypoints}}{\text{number of total keypoints}}$
Beat Consistency (BC)	Measures alignment between motion and speech rhythms.	$\text{BC} = \frac{1}{T}\sum_{t=1}^{T}\cos\bigl(\text{motion-beats}(t), \text{speech-beats}(t)\bigr)$
CLIP-Var	Quantifies texture consistency across different views.	$\text{CLIP-Var} = 1 - \min_{i \neq j}\frac{f_i \cdot f_j}{\lVert f_i \rVert \lVert f_j \rVert}$
Multimodal Distance (MM-Distance)	Measures alignment between generated motion and textual descriptions.	$\text{MM-Distance} = \sqrt{\frac{1}{N}\sum_{n=1}^{N}\lVert f_{a,n} - f_{b,n} \rVert^2}$

🏃 Physical Plausibility and Interaction

These metrics assess whether generated motion adheres to real‑world physical constraints.

Metric	Description	Formula
Foot Skating (FS)	Detects unnatural foot movements in generated motion.	$\text{FS} = \frac{1}{T}\sum_{t=1}^{T}\lVert \text{foot-velocity}(t) - \text{expected-velocity}(t) \rVert$
Mean Acceleration Difference (MAD)	Evaluates smoothness of generated motion by comparing acceleration.	$\text{MAD} = \frac{1}{n}\sum_{i=1}^{n}\lVert a_i^{\text{gen}} - a_i^{\text{gt}} \rVert^2$

⚡️ Efficiency and Computational Metrics

These metrics evaluate the computational cost of generative models.

Metric	Description	Formula
Execution Time	Measures the time required to generate outputs.	$\text{Execution Time} = \text{End Time} - \text{Start Time}$
Kernel Inception Distance (KID)	Measures output similarity using kernel functions.	$\text{KID} = \frac{1}{n(n-1)} \sum_{i \neq j} k(x_i, x_j) + \frac{1}{m(m-1)} \sum_{i \neq j} k(y_i, y_j) - \frac{2}{nm} \sum_{i=1}^{n} \sum_{j=1}^{m} k(x_i, y_j)$

|

👨 Face

Focuses on realistic face generation, facial reenactment, and attribute editing using GANs, diffusion models, and specialized frameworks.

🗂 Datasets

🏷️ Name	📊 Statistics	🔍 Modalities	🔗 Link
RaFD	More than 8,000 images. Images of 67 models displaying eight facial expressions, photographed from five different angles.	🖼️ Images	RaFD
MPIE	Over 750,000 images with a broad range of variations in facial expressions, head poses, and lighting conditions.	🖼️ Images	MPIE
VoxCeleb1	More than 100,000 utterances from 1,251 celebrities.	🔊 Audio, 🎥 Video	VoxCeleb1
VoxCeleb2	Over 1 million utterances from 6,112 celebrities.	🔊 Audio, 🎥 Video	VoxCeleb2
CelebA-HQ	30,000 images at a resolution of 1024×1024, providing detailed facial images of celebrities.	🖼️ Images	CelebA-HQ
FaceForensics	Over 1,000 video sequences with various face manipulations.	🎥 Video	FaceForensics
300-VW	About 300 videos of faces in various scenarios and lighting conditions.	🎥 Video	300-VW
FFHQ	70,000 images with extensive diversity, capturing various facial features, accessories, and environments.	🖼️ Images	FFHQ
AffectNet	Over 1 million images collected from the internet, with annotations for 11 different facial expressions and emotions.	🖼️ Images	AffectNet
M³ CelebA	Over 150K facial images annotated with semantic segmentation, facial landmarks, and captions in multiple languages.	🖼️ Images, 📝 Text	M³ CelebA
CUB	Over 11,000 images of 200 bird species, each annotated with various attributes like species, part locations, and bounding boxes.	🖼️ Images	CUB
CelebA-Dialog	202,599 face images from 10,177 identities, annotated with 5 fine-grained attributes: Bangs, Eyeglasses, Beard, Smiling, Age, along with captions and user editing requests.	🖼️ Images, 📝 Text	CelebA-Dialog
LS3D-W	A dataset of 230,000 3D facial landmarks.	🖼️ Images	LS3D-W
MERL-RAV	Over 19,000 face images with diverse head pose, all annotated by 68 point landmarks and visibility status.	🔊 Audio, 🎥 Video	MERL-RAV
AFLW2000-3D	Contains 2000 images with 68-point 3D facial landmarks, used to evaluate 3D facial landmark detection models with diverse head poses.	🖼️ Images, 🔷 3D/Point Cloud Data	AFLW2000-3D
FaceScape	Over 18K textured 3D faces, captured from 938 subjects, each with 20 specific expressions.	🔷 3D/Point Cloud Data	FaceScape

🤖 Models

StyleGAN 🔗
A generative adversarial network known for producing high-quality, photorealistic images. It serves as a backbone for many face generation and editing tasks.
ResNet 🔗
A convolutional neural network architecture that provides robust feature extraction, often used as a backbone in face generation pipelines.
Dual-Generator (DG) 🔗
A large-pose face reenactment model composed of two modules: the ID-Preserving Shape Generator (IDSG), which uses 3D landmark detection to capture local shape variations, and the Reenacted Face Generator (RFG), based on StarGAN2, to produce the final output.
Feature Disentanglement and Identity Transfer Model 🔗
An approach that bypasses the need for pre-trained structural priors by using a Feature Disentanglement module with Feature Displacement Fields (FDF) and an Identity Transfer (IdT) module based on self-attention to align source identity with target attributes.
Unified Neural Face Reenactment Pipeline 🔗
A pipeline that leverages a 3D shape model to obtain disentangled representations of pose, expression, and identity, mapping changes in these parameters to the latent space of a fine-tuned StyleGAN2 for accurate face reenactment.
Controllable 3D Generative Adversarial Face Model 🔗
A model that employs a Supervised Auto-Encoder (SAE) to disentangle identity and expression into separate latent spaces, using a Conditional GAN (cGAN) for smooth and controllable expression intensity.
AlbedoGAN 🔗
A self-supervised 3D generative face model that synthesizes high-resolution albedo and detailed 3D geometry. It refines facial textures (e.g., wrinkles) via a mesh refinement displacement map integrated with the FLAME model, and leverages CLIP for text-guided editing.
IricGAN (Information Retention and Intensity Control GAN) 🔗
A face editing method designed to preserve identity and semantic details while enabling controlled modifications of facial attributes. It features a Hierarchical Feature Combination (HFC) module and an Attribute Regression Module (ARM) for smooth intensity control.
GSmoothFace 🔗
A speech-driven talking face generation framework based on fine-grained 3D face modeling. It addresses lip synchronization and generalizability across speakers by introducing bias-based cross-attention and a Morphology Augmented Face Blending (MAFB) module.
Adaptive Latent Editing Model 🔗
A face editing approach that uses adaptive and nonlinear latent space transformations to flexibly learn transformations for complex, conditional edits while maintaining image quality and realism.
StyleT2I 🔗
A text-to-image synthesis model that improves compositionality and fidelity. It uses a CLIP-guided Contrastive Loss and a Text-to-Direction module to align StyleGAN’s latent codes with text descriptions, enhancing attribute control.
Hybrid Neural-Graphics Face Generation Model 🔗
A model that combines neural networks (using StyleGAN2 for texture and background synthesis) with fixed-function graphics components (such as a differentiable renderer and the FLAME 3D head model) to achieve interpretable control over facial attributes.
M3Face 🔗
A framework leveraging multimodal and multilingual inputs for both face generation and editing. It uses the Muse model to generate segmentation masks or landmarks from text and applies ControlNet architectures to refine the results, streamlining the process into a single step.
GuidedStyle 🔗
A framework for semantic face editing on StyleGAN that employs a pre-trained attribute classifier as a knowledge network and sparse attention to guide layer-specific modifications, ensuring that only targeted facial features are changed.
AnyFace 🔗
The first free-style text-to-face synthesis model capable of handling open-world text descriptions. It features a two-stream architecture that decouples text-to-face generation from face reconstruction, using CLIP-based cross-modal distillation and a Diverse Triplet Loss to enhance alignment and diversity.
HiFace 🔗
A 3D face reconstruction model that decouples static (e.g., skin texture) and dynamic (e.g., wrinkles) details using its SD-DeTail Module. It extracts shape and detail coefficients via ResNet-50 and uses MLPs with AdaIN to generate detailed displacement maps for realistic reconstructions and animations.

😃 Expression

Covers emotion-driven synthesis, facial expression retargeting, and multimodal methods that capture nuanced nonverbal cues.

🗂 Datasets

🏷️ Name	📊 Statistics	🔍 Modalities	🔗 Link
BEAT	76 hours of speech data, paired with 52D facial blend shape weights; 30 speakers performing in 8 distinct emotional styles across 4 languages.	🔊 Audio, 🖼️ Images, 🎥 Video, 📝 Text	BEAT
MEAD	A talking-face video corpus featuring 60 actors and actresses talking with eight different emotions at three intensity levels; approximately 40 hours of audio-visual clips per person and view.	🎥 Video, 🔊 Audio, 📝 Text, 🖼️ Images	MEAD
TEAD	50,000 quadruples, each including text, emotion tags, Action Units, blend shape weights, and situation sentences.	📝 Text, 🖼️ Images	-
JAFFE	213 images of 10 Japanese female models posing 7 facial expressions, annotated with average semantic ratings from 60 annotators.	🖼️ Images	JAFFE
MMI Facial Expression	Over 2900 videos and high-resolution still images of 75 subjects.	🎥 Video, 🖼️ Images, 📝 Text	MMI
Multiface	High-quality recordings of the faces of 13 identities. An average of 23,000 frames per subject; each frame includes roughly 160 different camera views.	🖼️ Images, 🔊 Audio, 📋 Tabular Data	Multiface
ICT FaceKit	4,000 high-resolution facial scans of 79 subjects (34 female, 45 male) aged 18–67, plus 99 full-head scans and 26 expressions per subject.	🔷 3D/Point Cloud Data, 🖼️ Images	ICT FaceKit
TikTok Dataset	Over 300 single-person dance videos (10–15 seconds each), extracted at 30fps, yielding 100K+ frames. Includes segmented images and computed UV coordinates.	🎥 Video, 🖼️ Images, 📋 Tabular Data	TikTok Dataset
Everybody Dance Now	Long single-dancer videos for training and evaluation; includes both self-filmed videos and short YouTube videos.	🎥 Video, 📋 Tabular Data	Everybody Dance Now
Obama Weekly Footage	17 hours of video footage, nearly two million frames, spanning eight years.	🎥 Video, 🔊 Audio	Obama Weekly Footage
VoxCeleb2	Over 1 million utterances from over 6,000 speakers, collected from YouTube videos with 61% male speakers.	🔊 Audio, 🎥 Video	VoxCeleb2
BIWI	Over 15K images of 20 people recorded with a Kinect while turning their heads around freely.	🔷 3D/Point Cloud Data, 🖼️ Images, 📋 Tabular Data, 📝 Text	BIWI
VOCASET	About 29 minutes of high-fidelity 4D scans captured at 60fps, synchronized with audio; features 12 speakers with 40 sequences per subject (each sequence consists of English sentences lasting 3–5 seconds).	🔷 3D/Point Cloud Data, 🔊 Audio	VOCASET
SHOW	Contains SMPLX parameters of 4 persons reconstructed from videos; includes 88-frame motion clips for training and validation.	🎥 Video, 🔊 Audio, 🖼️ Images, 📋 Tabular Data	SHOW

🤖 Models

🎙 Speech-Driven & Multimodal Expression Generation

Joint Audio-Text Model for 3D Facial Animation 🔗
Integrates a GPT-2-based text encoder with a dilated convolution audio encoder to improve upper-face expressiveness and lip synchronization. Lacks head and gaze control.
VOCA 🔗
A speech-driven facial animation model used as a baseline for lip synchronization and expressiveness.
MeshTalk 🔗
A model for speech-driven 3D facial animation, serving as a comparison baseline for upper-face motion and expressiveness.
CSTalk 🔗
Employs a transformer-based encoder to capture correlations across facial regions, enhancing emotional speech-driven animation; limited to five discrete emotions.
ExpCLIP 🔗
Aligns text, image, and expression embeddings via CLIP encoders, enabling expressive speech-driven facial animation from text/image prompts by leveraging the TEAD dataset and Expression Prompt Augmentation.
Style-Content Disentangled Expression Model 🔗
Enhances personalization in facial animation by disentangling style and content representations, thereby improving identity retention and transition smoothness. (Compared to FaceFormer.)
FaceFormer 🔗
A speech-driven facial animation model noted for its audio-visual synchronization, used as a baseline for comparison.
AdaMesh 🔗 Introduces an Expression Adapter (MoLoRA-enhanced) and a Pose Adapter (retrieval-based) for personalized speech-driven facial animation, achieving improved expressiveness, diversity, and synchronization compared to models such as GeneFace and Imitator.
FaceXHuBERT 🔗
Explores disentangling emotional expressiveness through multimodal representations as part of advanced speech-driven facial animation.
FaceDiffuser 🔗
Utilizes stochastic approaches to enhance motion variability and disentangle emotional expressiveness in facial animation.

🔁 Expression Retargeting & Motion Transfer

Neural Face Rigging (NFR) 🔗
Automates 3D mesh rigging by encoding interpretable deformation parameters, enabling fine-grained facial expression transfer.
MagicPose 🔗
Leverages diffusion models for 2D facial expression retargeting, balancing identity preservation and motion control through Multi-Source Attention and Pose ControlNet.
DiffSHEG 🔗
Pioneers joint 3D facial expression and gesture synthesis with speech-driven alignment, employing Fast Out-Painting-based Partial Autoregressive Sampling (FOPPAS) for seamless, real-time motion generation.
DreamPose 🔗
A baseline model for 2D facial expression retargeting used for comparison with MagicPose.
Disco 🔗
Serves as a comparison baseline in 2D facial expression retargeting, noted for its identity retention and generalization capabilities.
TalkSHOW 🔗
A speech-driven facial animation model referenced as a baseline for comparison with DiffSHEG.
LS3DCG 🔗
A model for 3D facial expression and gesture synthesis used as a baseline when comparing motion realism and synchronization.
DiffuseStyleGesture 🔗
Referenced as a baseline model for facial expression and gesture synthesis in comparison to DiffSHEG.

🖼 Image

Explores diffusion-based methods, VAEs, and other generative techniques to produce high-fidelity images and textures for animation backgrounds and elements.

🗂 Datasets

🏷️ Name	📊 Statistics	🔍 Modalities	🔗 Link
LAION-5B	5,85 billion CLIP-filtered image-text pairs	🖼️ Images, 📝 Text	LAION-5B
LAION-400M	400M English (image, text) pairs	🖼️ Images, 📝 Text	LAION-400M
LAION-Aesthetics v2	1,2B aesthetics scores of ≥4.5 939M aesthetics scores of ≥4.75 600M aesthetics scores of ≥5 12M aesthetics scores of ≥6 3M aesthetics scores of ≥6.25 625K aesthetics scores of ≥6.5	🖼️ Images, 📝 Text	LAION-Aesthetics v2
Open Images V7	9M images annotated with image-level labels, object bounding boxes, object segmentation masks, visual relationships, and localized narratives	🖼️ Images	Open Images V7
COYO	747M image-text pairs	🖼️ Images, 📝 Text	COYO
Conceptual Captions	3.3M images annotated with captions	🖼️ Images, 📝 Text	Conceptual Captions
COCO	330K images (>200K labeled) 1.5 million object instances 80 object categories 91 stuff categories 5 captions per image 250,000 people with key points	🖼️ Images, 📝 Text	COCO
ShareGPT	100k highly descriptive image-caption	🖼️ Images, 📝 Text	ShareGPT
ADE20K	20,210 images in the training set 2,000 images in the validation set 3,000 images in the testing set	🖼️ Images	ADE20K

🤖 Models

🔧 Fine-Tuning & Regularization

Spectral Shift Fine-Tuning 🔗
Introduces a compact parameter space called “spectral shift” for diffusion model fine-tuning. It reduces overfitting and storage inefficiency while achieving comparable or superior results in both single- and multi-subject generation. The method also employs the Cut Mix-Unmix data augmentation technique for improved multi-subject quality and acts as a regularizer enabling applications like single-image editing.
Control via Zero Convolutions (ControlNet) 🔗
Addresses the limited spatial control of text-to-image models by locking large pre-trained diffusion models and reusing their deep encoding layers as a robust backbone. Connected via “zero convolutions” (zero-initialized convolution layers), this approach progressively grows parameters from zero to prevent harmful noise during fine-tuning, thereby facilitating diverse conditional controls.

✂ Image Editing & Disentanglement

Lightweight Disentanglement for Image Editing 🔗
Explores the inherent disentanglement properties of stable diffusion models. By partially replacing text embeddings from a style-neutral description with one that reflects the desired style, a lightweight algorithm (optimizing only 50 parameters) is introduced for improved style matching and content preservation, outperforming more complex fine-tuning baselines.
SmartEdit 🔗
Frames image editing as a supervised learning problem by generating a paired training dataset of text editing instructions with before/after images. Built on the Stable Diffusion framework, it successfully handles challenging edits such as object replacement, seasonal changes, background modifications, and alterations of material attributes or artistic mediums.
Classifier-Free Guidance 🔗
Employs a modified classifier-free guidance strategy in two ways: by introducing model-based classifier-free guidance and by planting a content “seed” early during denoising. Coupled with a patch-based fine-tuning strategy on latent diffusion models (LDMs), this approach enables generation at arbitrary resolutions while leveraging large pre-trained models.
Null Embedding Optimization for High-Fidelity Reconstructions 🔗
Observes that DDIM inversion provides a good starting point but struggles with classifier-free guidance. By optimizing the unconditional null embedding used in classifier-free guidance, this method achieves high-fidelity reconstructions without additional tuning of the model or conditional embeddings, thereby preserving editing capabilities.
Unified Diffusion Model Editing Algorithm 🔗
Follows a three-stage approach: (i) optimizing text embeddings to match a given image, (ii) fine-tuning diffusion models for improved image alignment, and (iii) linearly interpolating between optimized and target text embeddings. This unified algorithm enables precise editing of diffusion models, aiming to make them more responsible and beneficial.
Debiasing Text-to-Image Diffusion Models 🔗
Enables targeted debiasing, removal of potentially copyrighted content, and moderation of offensive concepts using only text descriptions. This editing methodology can be applied to any linear projection layer by replacing pre-trained weights while preserving key concepts.

👽 Multimodal Conversations & Visual Understanding

AlignGPT 🔗
Comprises a multimodal large language model (MLLM) for enhanced multimodal perception. An accompanying AlignerNet bridges the MLLM to the diffusion U-Net image decoder, enabling coherent integration of textual and visual information.
KOSMOS-G 🔗
Offers seamless concept-level guidance from interleaved input to the image decoder. Serving as an alternative to CLIP, it facilitates effective image generation by guiding the diffusion process with interleaved multimodal cues.
MM-REACT 🔗
Presents a unified approach that synergizes multimodal reasoning and action to tackle complex visual understanding tasks. Extensive zero-shot experiments demonstrate its capabilities in multi-image reasoning, multi-hop document understanding, and open-world concept comprehension.

👤 Avatar

Reviews approaches for both 2D and 3D avatar creation, emphasizing lifelike digital representations with detailed facial expressions and body dynamics.

🗂 Datasets

🏷️ Name	📊 Statistics	🔍 Modalities	🔗 Link
WildAvatar	Over 10,000 human subjects; extracted from YouTube; significantly richer than previous datasets for 3D human avatar creation	🎥 Video, 🔷 3D/Point Cloud Data, 🔊 Audio	WildAvatar
ZJU-MoCap	Multi-camera system with 20+ synchronized cameras; includes SMPL-X parameters for detailed motion capture of body, hand, and face; complex actions such as twirling, Taichi, and punching	🎥 Video, 🔷 3D/Point Cloud Data	ZJU-MoCap
TalkSHOW	26.9 hours of in-the-wild talking videos from 4 speakers; expressive 3D whole-body meshes reconstructed at 30 fps, synchronized with audio at 22 kHz	🔊 Audio, 🔷 3D/Point Cloud Data	TalkSHOW
HuMMan	1,000 human subjects, 400k sequences, 60M frames; include point clouds, SMPL parameters, and textured meshes for multimodal sensing	🎥 Video, 🔷 3D/Point Cloud Data	HuMMan
BUFF	6 subjects performing motions in two clothing styles; 13,632 3D scans with high-resolution ground-truth minimally-clothed shapes	🔷 3D/Point Cloud Data	BUFF
AMASS	Combines 15 motion capture datasets into a unified framework with over 42 hours of motion data; 346 subjects and 11,451 motions with SMPL pose parameters, 3D shape parameters, and soft-tissue coefficients	🔷 3D/Point Cloud Data	AMASS
3DPW	60 video sequences with accurate 3D poses using video and IMU data; 18 re-poseable 3D body models with different clothing variations	🎥 Video, ⏱️ Time-Series Data, 🔷 3D/Point Cloud Data	3DPW
AIST++	10,108,015 frames of 3D key points with corresponding images; 1,408 dance motion sequences spanning 10 dance genres with synchronized music	🎥 Video, 🔊 Audio, 🔷 3D/Point Cloud Data	AIST++
RenderMe-360	Over 243 million head frames from 500 identities; includes FLAME parameters, UV maps, action units, textured meshes, and diverse annotations	🎥 Video, 🔷 3D/Point Cloud Data	RenderMe-360
PuzzleIOI	41 subjects with nearly 1,000 Outfit-of-the-Day (OOTD) configurations; includes paired ground-truth 3D body scans for challenging partial photos	🖼️ Images, 🔷 3D/Point Cloud Data, 📝 Text	PuzzleIOI

🤖 Models

🔍 CLIP-Guided Models

AvatarCLIP 🔗
A zero-shot framework for generating and animating 3D avatars from natural language descriptions. It uses a shape VAE for initial geometry generation guided by CLIP and integrates NeuS for high-quality geometry and photorealistic rendering. In the motion phase, candidate poses are selected via CLIP and a motion VAE synthesizes smooth motions.
DreamField 🔗
Adapts NeRF for text-driven 3D object generation. While it facilitates text-to-3D synthesis, it struggles with capturing detailed geometry.
Text2Mesh 🔗
Stylizes existing meshes using CLIP guidance. It aims for text-driven mesh modifications but faces challenges with stability and flexibility when handling diverse text descriptions.

🧩 Implicit Function-Based Models

PIFu (Pixel-Aligned Implicit Function) 🔗
Reconstructs detailed 3D surfaces from single-view 2D images by projecting 3D points into 2D space to extract pixel-aligned features via CNNs, which are then processed by an MLP for high-resolution surface reconstructions.
PIFuHD 🔗
Enhances PIFu by incorporating multi-scale feature extraction, leading to improved global shape understanding and finer surface details.
ARCH (Animatable Reconstruction of Clothed Humans) 🔗
Reconstructs detailed 3D models of clothed individuals from single RGB images. It transforms poses into a canonical space using a parametric body model and employs an implicit surface representation to capture fine details such as clothing folds.
ARCH++ 🔗
An enhanced version of ARCH that refines geometry encoding and boosts clothing details to produce photorealistic, animatable avatars.
PaMIR (Parametric Model-Conditioned Implicit Representation) 🔗
Combines a parametric SMPL body model with an implicit surface representation to reconstruct 3D humans from single RGB images. It uses a depth-ambiguity-aware loss and refines SMPL parameters during inference for better alignment.
TADA (Text to Animatable Dynamic Avatar) 🔗
Generates high-fidelity, animatable 3D avatars directly from text prompts. It leverages an upsampled SMPL-X model and learnable displacements, optimizing geometry and texture via Score Distillation Sampling losses, with additional detail enhancement through partial mesh subdivision.
GETAvatar (Generative Textured Meshes for Animatable Human Avatars) 🔗
Directly produces high-fidelity, explicitly textured 3D meshes. It represents human bodies using an articulated 3D mesh and generates a signed distance field (SDF) in canonical space, which is deformed to match the target shape and pose via SMPL-based transformations. A normal field trained on 3D scans enhances fine geometric details.
RodinHD 🔗
Creates 3D avatars from a single portrait image by constructing a detailed 3D blueprint (triplane) that captures the avatar’s shape, textures, and fine details. A shared neural decoder then converts this blueprint into an image, with a cascaded diffusion model generating new triplanes based on the portrait.

🎥 NeRF-Based Methods

HumanNeRF 🔗
Pioneers the use of deformation fields for dynamic human models from monocular images, enabling the mapping of points from observation to canonical space.
Neural Body 🔗
Introduces structured latent codes anchored to SMPL model vertices, processed via SparseConvNet, to regularize dynamic human modeling.
Neural Human Performer 🔗
Captures dynamic human information directly in the observation space using a skeletal feature bank and transformer modules.
Vid2Avatar 🔗
Jointly models human subjects and scene backgrounds using two separate neural radiance fields, enhancing realism in avatar generation.
DreamHuman 🔗
Generates animatable 3D human avatars from textual descriptions by combining NeRF with the imGHUM body model. It uses human body shape statistics for anatomical correctness and incorporates semantic zooming for detailed regions such as faces and hands.

🌈 Diffusion-Based Methods

Personalized Avatar Scene (PAS) [🔗]
Generates customized 3D avatars in various poses and scenes based on text descriptions. It employs a diffusion-based transformer to generate 3D body poses conditioned on text.
3D Head Avatar via 3DMM & Diffusion 🔗
Combines a parametric 3D Morphable Model of the head (using FLAME [153]) with diffusion models to jointly optimize geometry and texture for generating 3D head avatars from text prompts.
Make-Your-Anchor 🔗
Introduces a novel approach for generating 2D anchor-style avatars capable of realistic full-body motion and expression. It utilizes a Structure-Guided Diffusion Model (SGDM) to ensure coherent and expressive avatar generation.

🔀 Hybrid Methods

DreamAvatar 🔗
Integrates shape priors, diffusion models, and NeRF architecture within a dual-observation-space (DOS) framework. Leveraging SMPL for anatomical guidance and employing joint optimization with specialized head-focused VSD loss (using ControlNet [310]), it ensures structurally consistent avatars with controllable shape modifications. While it outperforms methods like DreamWaltz [111] in geometric accuracy, it currently lacks animation capabilities and may inherit biases from pretrained diffusion models.
DreamWaltz 🔗
Referenced as a comparative baseline, this model illustrates limitations in animation capabilities and inherited biases when compared to hybrid approaches like DreamAvatar.

🤝 Gesture

Examines methods for generating human-like gestures and co-speech movements, critical for interactive and immersive animations.

🗂 Datasets

🏷️ Name	📊 Statistics	🔍 Modalities	🔗 Link
IEMOCAP	151 recorded dialogue videos, with 2 speakers per session, totaling 302 videos. Annotated for 9 emotions and valence, arousal, and dominance. Contains approximately 12 hours of audiovisual data.	🎥 Video, 🔊 Audio, 📝 Text, 📋 Tabular Data	IEMOCAP
SaGA	25 dialogues between interlocutors (50 total). Language: German. Published speakers: 6, unpublished speakers: 19. Annotated gestures: 1,764 (total corpus). Total video duration: 1 hour.	🎥 Video, 🔊 Audio, 📋 Tabular Data	SaGA
Creative-IT	Data from 16 actors (male and female). Affective dyadic interactions range from 2 to 10 minutes each. Approximately 8 sessions of audiovisual data were released.	🎥 Video, 🔊 Audio, 📝 Text, 📋 Tabular Data	CreativeIT
CMU Panoptic	3D facial landmarks from 65 sequences (5.5 hours). Contains 1.5 million 3D skeletons.	🎥 Video, 🔊 Audio, 📝 Text	CMU Panoptic
Speech-Gesture	A 144-hour dataset featuring 10 speakers. Includes frame-by-frame, automatically detected pose annotations.	🎥 Video, 🔊 Audio	Speech-Gesture
Talking With Hands 16.2M	16.2 million frames (50 hours) of two-person, face-to-face spontaneous conversations. Strong covariance in arm and hand features.	🎥 Video, 🔊 Audio	Talking With Hands
PATS	25 speakers, 251 hours of data, approximately 84,000 intervals. Mean interval length: 10.7 seconds.	🎥 Video, 🔊 Audio, 📝 Text	PATS
Trinity Speech-Gesture II	244 minutes of motion capture and audio (23 takes). Includes one male native English speaker. The skeleton consists of 69 joints.	🎥 Video, 🔊 Audio, 📋 Tabular Data	Trinity
SaGA++	25 recordings, totaling 4 hours of data.	🎥 Video, 🔊 Audio, 📝 Text, 📋 Tabular Data	SaGA++
ZEGGS	67 monologue sequences with 19 different motion styles. Performed by a female actor speaking in English. Total duration: 134.65 minutes.	🎥 Video, 🔊 Audio	ZEGGS
BEAT	76 hours of 3D motion capture data from 30 speakers. Covers 8 emotions and 4 languages. Includes 32 million frame-level emotion and semantic relevance annotations.	🎥 Video, 🔊 Audio, 📝 Text, 📋 Tabular Data	BEAT
BEAT2	60 hours of mesh-level, motion-captured co-speech gesture data. Integrates SMPL-X body and FLAME head parameters. Enhances modeling of head, neck, and finger movements.	🎥 Video, 🔊 Audio, 📋 Tabular Data	BEAT2
GAMT	176 video clips of volunteers using math terms and gestures. Covers 8 classes of mathematical terms and gestures.	🎥 Video, 🔊 Audio, 📝 Text	GAMT
SeG	208 types of global semantic gestures. 544 motion files recorded from a male performer. Each gesture is represented in 2.6 variations on average.	🎥 Video, 🔊 Audio, 📋 Tabular Data	SeG
DND Group Gesture	6 hours of gesture data from 5 individuals playing Dungeons & Dragons. Recorded over 4 sessions (total duration: 6 hours). Includes beat, iconic, deictic, and metaphoric gestures.	🎥 Video, 🔊 Audio, 📋 Tabular Data	DND

🤖 Models

🛠 Traditional & Parametric Approaches

Parameter-Based Procedural Animation 🔗
Uses high-level control parameters (e.g., emotion, speech intensity, rhythm) to select and interpolate predefined keyframes, yielding smooth and coherent gesture sequences.
Blendshape Models 🔗
Generates detailed hand and finger gestures by blending a set of predefined base shapes using weighted interpolation, enabling fine-grained control and smooth transitions.

🧠 Deep Learning-Based Models

GestureGAN 🔗
Employs a GAN-based generator-discriminator framework to synthesize realistic gesture sequences conditioned on audio inputs, capturing dynamic hand gestures effectively.
Speech2Gesture 🔗
Generates co-speech gestures directly from speech features using LSTM/RNN architectures, effectively modeling temporal dependencies between speech and gesture.
StyleGestures 🔗
Utilizes an encoder-decoder architecture with style tokens and Transformers to capture individual speaker styles, enabling personalized gesture synthesis.
Audio-Driven Adversarial Gesture Generation 🔗
Combines GANs with Conditional Variational Autoencoders (CVAE) to align audio and motion features in a shared latent space, resulting in nuanced, audio-driven gestures.
GestureDiffuCLIP 🔗
Leverages a diffusion process guided by CLIP for semantic alignment to iteratively refine gesture sequences, producing highly expressive gestures.
ZeroEGGS 🔗
Implements a zero-shot paradigm for generating gestures based solely on speech, using example-based learning to generalize across unseen gestural styles.
GestureMaster 🔗
Utilizes a graph neural network (GNN) framework to model spatial and temporal dependencies in gesture sequences, enhancing naturalistic hand and body gesture synthesis.
ExpressGesture 🔗
Integrates emotion recognition with gesture generation pipelines, creating gestures that reflect both the content of speech and underlying sentiment.
MocapNET 🔗
Bridges traditional motion capture with neural synthesis by combining 2D pose estimation and 3D gesture reconstruction using multimodal motion capture datasets.
CSMP 🔗
A diffusion-based co-speech gesture generation model that leverages joint text and audio representations to capture intricate inter-modal relationships.
ZS-MSTM 🔗
Introduces a zero-shot style transfer method for gesture animation through adversarial disentanglement, separating style and content features for effective style transfer across speakers.

🚀 Transformer-Based Models

Gesticulator 🔗
Employs a multimodal Transformer architecture to generate contextually relevant gestures conditioned on both text and audio inputs, aligning with co-speech dynamics.
Mix-StAGE 🔗
Uses an attention-based encoder-decoder with a style encoder and mixed spatial-temporal attention mechanisms to capture dynamic, expressive gestures.
SAGA (Style and Grammar-Aware Gesture Generation) 🔗
Combines an LSTM-based encoder-decoder with a Transformer-based grammar encoder to align gestures accurately with linguistic content, integrating both style and grammatical cues.
Cross-Modal Transformer 🔗
Leverages cross-attention mechanisms to fuse diverse modalities (text, audio, video), enhancing the coherence and contextual alignment of generated gestures.
DiM-Gesture 🔗
Introduces an adaptive layer normalization mechanism (Mamba-2) to adjust to different speakers, focusing on generating realistic co-speech gestures from audio.
AMUSE 🔗
Utilizes a disentangled latent diffusion technique to separate emotional expressions from gestures, enabling control over emotional aspects via a multi-stage training pipeline.
FreeTalker 🔗
Employs a diffusion-based framework with classifier-free guidance and a generative prior (DoubleTake) to produce natural transitions between gesture clips, extending beyond co-speech gestures.
CoCoGesture 🔗
Addresses long-sequence gesture generation with a Transformer-based diffusion model that uses a large dataset (GES-X) and a mixture-of-experts framework to effectively align gestures with human speech.
DiffuseStyleGestures 🔗
Integrates audio, text, speaker IDs, and seed gestures within a diffusion-based approach to produce stylistically diverse co-speech gesture outputs.
DiffuseStyleGesture+ 🔗
Builds upon DiffuseStyleGestures by further refining gesture synthesis through advanced multimodal integration and specialized attention mechanisms for personalized outputs.
ViTPose 🔗
Applies Vision Transformers to human pose estimation, providing a robust foundation for gesture synthesis by accurately capturing pose dynamics.
Gesture Motion Graphs 🔗
Utilizes graph-based modeling for few-shot gesture reenactment, effectively representing motion sequences and their dependencies.
DiffSHEG 🔗
Adopts a diffusion-based approach for real-time speech-driven 3D expression and gesture generation, leveraging joint text and audio representations for coherent outputs.
C2G2 🔗
Emphasizes controllability in co-speech gesture generation by using modular components to handle different aspects of gesture synthesis.
DiffuGesture 🔗
Focuses on generating gestures for two-person dialogues with specialized diffusion techniques tailored for interactive and conversational settings.

🎥 Motion

Highlights text-constrained motion generation techniques, including MotionGPT and diffusion frameworks, for creating smooth and realistic animation sequences.

🗂 Datasets

🏷️ Name	📊 Statistics	🔍 Modalities	🔗 Link
Motion-X++	19.5 million 3D poses across 120,500 sequences, synchronized with 80,800 RGB videos and 45,300 audio tracks. Annotated with free-form text descriptions.	🔷 3D/Point Cloud Data, 📝 Text, 🔊 Audio, 🎥 Video	Motion-X++
HumanMM (ms-Motion)	120 long-sequence 3D motions reconstructed from 500 in-the-wild multi-shot videos, totaling 60 hours of data. Includes rare interactions.	🔷 3D/Point Cloud Data, 🎥 Video	HumanMM
Multimodal Anatomical Motion	51,051 annotated poses with 53 anatomical landmarks, captured across 48 virtual camera views per pose. Includes 2,000+ pathological motion variations.	🔷 3D/Point Cloud Data, 📝 Text	-
AMASS	11,265 motion clips aggregated from 15 mocap datasets (e.g., CMU, KIT), totaling 43 hours of motion data in SMPL format. Covers 100+ action categories.	🔷 3D/Point Cloud Data	AMASS
HumanML3D	14,616 motion sequences (28.6 hours) paired with 44,970 free-form text descriptions spanning 200+ action categories.	🔷 3D/Point Cloud Data, 📝 Text	HumanML3D
BABEL	43 hours of motion data from AMASS, annotated with 250+ verb-centric action classes across 13,220 sequences. Includes temporal action boundaries.	🔷 3D/Point Cloud Data, 📝 Text	BABEL
AIST++	1,408 dance sequences (10.1 million frames) captured from 9 camera views, totaling 15 hours of multi-view RGB video data.	🔷 3D/Point Cloud Data, 🎥 Video	AIST++
3DPW	60 sequences (51,000 frames) captured in diverse indoor/outdoor environments, featuring challenging poses and natural object interactions.	🔷 3D/Point Cloud Data, 🎥 Video	3DPW
PROX	20 subjects performing 12 interactive scenarios in 3D scenes, including 180 annotated RGB frames for scene-aware motion analysis.	🔷 3D/Point Cloud Data, 🖼️ Images	PROX
KIT-ML	3,911 motion clips (11.23 hours) with 6,278 natural language annotations containing 52,903 words, stored in BVH/FBX formats.	🔷 3D/Point Cloud Data, 📝 Text	KIT-ML

🤖 Models

🔤 Language-to-Pose Models

Language2Pose 🔗
Generates 3D human poses directly from natural language. It employs two encoders (for text and 3D motion) that map inputs into a joint embedding space, and a decoder that produces a fixed-length motion sequence by minimizing the distance between corresponding text and motion embeddings.
MotionClip 🔗
An end-to-end pipeline for motion generation based on an encoder-decoder transformer. The model extracts a high-level motion representation and uses multiple loss functions—comparing joint orientations, velocities, and image-text groundings via CLIP—to enhance motion quality.

📦 Variational Auto-Encoder (VAE) Based Models

ACTOR 🔗
Generates diverse and realistic 3D human motions conditioned on action labels. This approach uses a transformer-based VAE to encode actions and poses into a Gaussian latent space, allowing sampling of varied motions for the same action prompt.
TEMOS (Text-To-Motions) 🔗
A text-conditioned generative model that uses a transformer-based VAE architecture with two symmetric encoders (for motion and text). It learns a diverse latent space by aligning text and pose embeddings to generate meaningful SMPL body motions.
Teach 🔗
Transforms sequences of text descriptions into SMPL body motions. The model operates non-autoregressively within individual actions and autoregressively across action sequences by leveraging a past-conditioned text encoder that combines historical motion features with current text input.
Generating Diverse and Natural 3D Human Motions from Text 🔗
Generates 3D human motions from textual descriptions by first pre-training an auto-encoder (using convolutional and deconvolutional layers) and then utilizing a temporal VAE with three recurrent networks (prior, posterior, and generator) to produce motion snippets.
TMR 🔗
Enhances transformer-based text-to-motion generation by mapping motion and text embeddings into a joint space. Dual transformer encoders are used for each modality, and cosine similarity between embeddings is maximized for positive pairs while filtering out negatives via MPNet similarity.

🗝 VQ-VAE Based Models

T2M GPT 🔗
The first model to apply VQ-VAE for motion generation. It learns a discrete representation (codebook) of motion and formulates motion generation as an autoregressive token prediction task, conditioned on text encoded by CLIP.
DiverseMotion 🔗
Builds upon T2M GPT by discarding the autoregressive generation in favor of a diffusion process to diversify motion outputs. It employs CLIP for text encoding and Hierarchy Semantic Aggregation (HSA) to generate a richer holistic text embedding.
MoMask 🔗
Uses a hierarchical VQ-VAE to quantize motion sequences into discrete tokens over multiple layers. A masked transformer predicts missing tokens (similar to BERT), and a residual transformer refines these predictions to incorporate fine motion details.
T2LM Long-Term 3D Human Motion 🔗
Transforms sequences of text descriptions into 3D motion sequences using a 1D-convolutional VQ-VAE and a transformer-based text encoder. This method generates smooth transitions between actions, outperforming earlier techniques like TEACH.
MotionGPT 🔗
Generates human motion from text by leveraging a pre-trained motion VQ-VAE alongside a large language model (LLM). The LLM is fine-tuned with LoRA to generate motion tokens that the VQ-VAE decoder transforms into motion sequences, significantly speeding up training.

🌈 Diffusion-Based Models

Flame 🔗
Employs a transformer-based motion decoder within a diffusion framework. It conditions on text using cross-attention (with text embeddings from RoBERTA) and incorporates special tokens for motion length and diffusion time steps. The model is optimized with a hybrid loss combining diffusion noise loss and a variational lower bound loss, with classifier-free guidance during inference.
MotionDiffuse 🔗
Similar to Flame but with slight architectural variations: it selects a random diffusion time step and divides the motion sequence into sub-intervals for time-varying conditioning. It utilizes efficient attention modules and optimizes using mean squared error on the noise prediction.
HMDM 🔗
A diffusion-based model with a fixed motion sequence length that leverages CLIP’s text encoder. It introduces additional loss functions (e.g., position, foot, and velocity losses) defined on the reconstructed motion signal, rather than just the noise, to improve temporal consistency and motion fidelity.
Make-An-Animation 🔗
Proposes a two-stage diffusion framework for text-to-3D motion generation. The model pre-trains on a large-scale static pose dataset using a UNet backbone and T5 text encoder, then fine-tunes on motion datasets, generating the entire motion sequence concurrently for improved smoothness.
GMD (Guided Motion Diffusion) 🔗
Focuses on incorporating spatial (trajectory) constraints into the diffusion process. The method uses a two-stage pipeline that first emphasizes ground location guidance and then propagates sparse guidance gradients across neighboring frames to enhance overall motion consistency.
OmniControl 🔗
Extends spatial guidance by cumulatively summing relative pelvis locations to infer global positions. It also introduces realism guidance, propagating control signals from keyframes and the pelvis to other joints for coherent, natural motion generation.

📦 Object

Discusses approaches for text-to-3D object generation, such as Neural Radiance Fields (NeRFs) and 3D Gaussian Splatting, to create realistic assets.

🗂 Datasets

🏷️ Name	📊 Statistics	🔍 Modalities	🔗 Link
ShapeNet	3D models in categories like furniture and vehicles.	🔷 3D/Point Cloud Data, 📝 Text	ShapeNet
BuildingNet	Architectural structures for shape completion tasks.	🔷 3D/Point Cloud Data, 📝 Text	BuildingNet
Text2Shape	Textual descriptions linked to ShapeNet categories.	📝 Text, 🔷 3D/Point Cloud Data	Text2Shape
ShapeGlot	Textual utterances describing differences between shapes.	📝 Text, 🔷 3D/Point Cloud Data	ShapeGlot
Pix3D	3D models aligned with real-world images for evaluation.	🖼️ Images, 🔷 3D/Point Cloud Data	Pix3D
LAION-5B	Large-scale dataset with 5 billion image-text pairs.	🖼️ Images, 📝 Text	LAION-5B
COCO-Stuff	Annotated images for real-world 3D synthesis.	🖼️ Images, 📝 Text	COCO-Stuff
Flickr30K	Image dataset with diverse textual descriptions.	🖼️ Images, 📝 Text	Flickr30K
ModelNet40	3D CAD models across 40 object categories.	🔷 3D/Point Cloud Data	ModelNet40
ShapeNetCore	Subset of ShapeNet with detailed object models.	🔷 3D/Point Cloud Data, 📝 Text	ShapeNetCore
BlendSwap	Realistic 3D models with physically based rendering (PBR).	🔷 3D/Point Cloud Data, 🖼️ Images	BlendSwap
InstructPix2Pix	Dataset for instruction-driven image modifications.	🖼️ Images, 📝 Text	InstructPix2Pix
MagicBrush	Dataset for refining texture and appearance in 3D.	🖼️ Images	MagicBrush
NeRF-Synthetic	2D images rendered from synthetic 3D scenes.	🖼️ Images	NeRF-Synthetic
ScanNet	2.5M RGB-D views with semantic segmentations and camera poses.	🖼️ Images, 📝 Text	ScanNet
Matterport3D	10,800 panoramic views from 90 building-scale scenes.	🖼️ Images, 🔷 3D/Point Cloud Data, 📝 Text	Matterport3D

🤖 Models

🧵 Texture

Focuses on methods for generating detailed surface textures that enhance the realism of 3D models, including text-guided synthesis and neural rendering techniques.

🗂 Datasets

🏷️ Name	📊 Statistics	🔍 Modalities	🔗 Link
3D-FUTURE	9,992 detailed 3D furniture models with high-res textures; 20,240 synthetic images across 5,000 scenes.	🔷 3D Geometry, 🖼️ Texture, 🖼️ 2D Images	3D-FUTURE
Objaverse	Over 800 K textured 3D models with natural-language descriptions across diverse categories.	🔷 3D Geometry, 🖼️ Texture, 📝 Language	Objaverse
ShapeNet	Large-scale structured 3D meshes (incl. 300 car models) used for texture benchmarking.	🔷 3D Geometry, 🖼️ Texture	ShapeNet
ShapeNetSem	Semantic extension of ShapeNet with 445 annotated meshes for structure-aware evaluation.	🔷 3D Geometry, 🖼️ Texture	ShapeNetSem
ModelNet40	40-category CAD benchmark for generalization testing in geometry-aware texture generation.	🔷 3D Geometry	ModelNet40
Sketchfab	Repository of commercial and scanned 3D models for qualitative texture evaluation.	🔷 3D Geometry, 🖼️ Texture	Sketchfab
CGTrader	High-res 3D assets for mesh diversity in text-driven synthesis.	🔷 3D Geometry, 🖼️ Texture	CGTrader
TurboSquid	Commercial dataset of detailed assets and fine-surface textures for high-fidelity evaluations.	🔷 3D Geometry, 🖼️ Texture	TurboSquid
RenderPeople	High-quality human scans with detailed anatomy and surface properties for text-to-texture testing.	🔷 3D Scans	RenderPeople
Tripleganger	Scanned high-fidelity human models for evaluating facial and clothing texture realism.	🔷 3D Scans	Tripleganger
Stanford 3D Scans	High-resolution object scans for generalization tests on real-world geometries.	🔷 3D Scans	Stanford 3D Scans
ElBa	30 K synthetic texture images with 3 M texel-level annotations for element-based analysis.	🖼️ 2D Texture, 📝 Attributes & Layout	ElBa

🤖 Models

CLIP-Pseudo Inpainting 🔗
Pioneering masked-inpainting pipeline using CLIP pseudo-captioning to semantically align 2D renderings with 3D geometry without paired text data.
arXiv:2303.13273
Text2Tex 🔗
Two-stage diffusion: Stage I generates initial textures via depth-to-image denoising; Stage II back-projects and refines them in UV space by selecting extra views to correct artifacts.
arXiv:2303.11396
TEXTure 🔗
Inpainting-based diffusion with trip-based surface segmentation to generate, refine, or preserve regions, ensuring smooth transitions and efficient passes.
Project page
Paint-it 🔗
Integrates PBR rendering and U-Net reparameterization with CLIP-guided Score Distillation Sampling for high-fidelity mesh texturing, at the cost of per-model optimization time.
arXiv:2312.11360
Point-UV Diffusion 🔗
Coarse-to-fine pipeline: initial mesh-surface painting then 2D UV diffusion refinement, decoupling global structure generation from fine-detail synthesis.
ICCV 2023 paper
TexPainter 🔗
Latent diffusion in color-space embeddings using depth-conditioned DDIM sampling across fixed viewpoints, aggregated into a unified texture map.
arXiv:2406.18539
TexFusion 🔗
Sequential Interlaced Multiview Sampler fuses multi-view latent features during diffusion, reducing inference time while preserving cross-view coherence.
arXiv:2310.13772
GenesisTex 🔗
Cross-view attention during diffusion followed by Img2Img post-processing to eliminate seams and enhance surface detail in UV maps.
arXiv:2403.17782
Consistency² 🔗
Latent Consistency Models that achieve fast, multi-view coherent textures with just four denoising steps, disentangling noise and color paths.
arXiv:2406.11202
Meta 3D TextureGen 🔗
Two-stage: geometry-aware diffusion produces multi-view images; incidence-aware UV inpainting and patch upscaling yield seamless 4K textures.
Meta Research
VCD-Texture 🔗
Variance Alignment with joint noise prediction and multi-view aggregation modules to maintain statistical feature consistency across views.
arXiv:2407.04461

🔗 Citations

If you find our paper or repository useful, please cite the paper:

@article{abootorabi2025generative,
  title={Generative AI for Character Animation: A Comprehensive Survey of Techniques, Applications, and Future Directions},
  author={Abootorabi, Mohammad Mahdi and Ghahroodi, Omid and Zahraei, Pardis Sadat and Behzadasl, Hossein and Mirrokni, Alireza and Salimipanah, Mobina and Rasouli, Arash and Behzadipour, Bahar and Azarnoush, Sara and Maleki, Benyamin and others},
  journal={arXiv preprint arXiv:2504.19056},
  year={2025}
}

📧 Contact

If you have questions, please send an email to mahdi.abootorabi2@gmail.com.

Name		Name	Last commit message	Last commit date
Latest commit History 78 Commits
README.md		README.md

llm-lab-org/Generative-AI-for-Character-Animation-Survey

Folders and files

Latest commit

History

Repository files navigation

Generative AI for Character Animation: A Comprehensive Survey of Techniques, Applications, and Future Directions

📢 News

📑 List of Contents

📝 Abstract

🗺 Overview

🌳 Taxonomy

📚 Background

🤖 Models

🎨 Computer Graphics Models

👀 Vision

📝 Language Models

🕒 Temporal Sequence Modeling

🗣 Speech Models

🎭 Additional Generative Models

📊 Metrics

✅ Quality and Realism of Generated Output

🔄 Diversity and Multimodality

🎯 Relevance and Accuracy

🏃 Physical Plausibility and Interaction

⚡️ Efficiency and Computational Metrics

👨 Face

🗂 Datasets

🤖 Models

😃 Expression

🗂 Datasets

🤖 Models

🎙 Speech-Driven & Multimodal Expression Generation

🔁 Expression Retargeting & Motion Transfer

🖼 Image

🗂 Datasets

🤖 Models

🔧 Fine-Tuning & Regularization

✂ Image Editing & Disentanglement

👽 Multimodal Conversations & Visual Understanding

👤 Avatar

🗂 Datasets

🤖 Models

🔍 CLIP-Guided Models

🧩 Implicit Function-Based Models

🎥 NeRF-Based Methods

🌈 Diffusion-Based Methods

🔀 Hybrid Methods

🤝 Gesture

🗂 Datasets

🤖 Models

🛠 Traditional & Parametric Approaches

🧠 Deep Learning-Based Models

🚀 Transformer-Based Models

🎥 Motion

🗂 Datasets

🤖 Models

🔤 Language-to-Pose Models

📦 Variational Auto-Encoder (VAE) Based Models

🗝 VQ-VAE Based Models

🌈 Diffusion-Based Models

📦 Object

🗂 Datasets

🤖 Models

🧵 Texture

🗂 Datasets

🤖 Models

🔗 Citations

📧 Contact

⭐ Star History

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 4

Packages