Skip to content

llm-lab-org/Generative-AI-for-Character-Animation-Survey

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

78 Commits
Β 
Β 

Repository files navigation

Generative AI for Character Animation: A Comprehensive Survey of Techniques, Applications, and Future Directions

arXiv Website

This repository is designed to collect and categorize papers, datasets, and resources related to generative AI for character animation based on our survey. As advances in generative AI continue to transform animation from realistic facial synthesis to dynamic gesture and motion generation, this resource will be continuously updated to serve as a comprehensive guide for researchers and practitioners. Given the rapid growth in this field, we will continuously update both the paper and this repository to serve as a resource for researchers working on future projects.


πŸ“’ News

  • April 27, 2025: We release the first version of our survey.

    Feel free to cite, contribute, or open a pull request to add recent related papers!

πŸ“‘ List of Contents


πŸ“ Abstract

Generative AI is transforming various fields, including art, gaming, and animation. One of its most significant applications lies in animation, where advances in artificial intelligenceβ€”such as foundation models and diffusion modelsβ€”have driven remarkable progress, significantly reducing the time and cost of content creation. Characters are central components of animations involving elements such as motion, emotions, gestures, and facial expressions. Rapid and wide-ranging developments in AI-driven animation technologies have made it challenging to maintain an overarching view of progress in the field, highlighting the need for a comprehensive survey to integrate and contextualize these advancements.

This survey offers a comprehensive review of the state-of-the-art generative AI applications for animated character design and behavior, integrating a wide range of aspects often examined in isolation (e.g., avatars, gestures, and facial expressions). Unlike previous studies, it provides a unified perspective covering all major applications of generative AI in character animation. The survey begins with foundational concepts and introduces evaluation metrics tailored to this domain, then explores key areas such as facial animation, image synthesis, avatar generation, gesture modeling, motion synthesis, expression rendering, and texture generation. Finally, it addresses the main challenges and outlines future research directions, offering a roadmap to advance AI-driven character animation technologies. This survey aims to serve as a resource for researchers and developers in generative AI for animation and related fields.


πŸ—Ί Overview

overview.png

🌳 Taxonomy

Taxonomy_page-0001

πŸ“š Background

πŸ€– Models

🎨 Computer Graphics Models

  • SMPL πŸ”—
    A popular parametric model representing 3D human body geometry using a low-dimensional representation for shape (Ξ²) and pose (ΞΈ).
    • SMPL+H
      An extension of SMPL that incorporates detailed hand modeling by introducing hand joint parameters (ΞΈhands).
    • SMPL-X
      Further extends SMPL+H by including facial expressions along with detailed hand and body modeling for full-body human representation.
  • SMIL (Skinned Multi-Infant Linear Model) πŸ”—
    A model developed specifically for infants, addressing challenges in capturing non-cooperative subjects with low-quality RGB-D data.
  • SMAL (Skinned Multi-Animal Linear Model) πŸ”—
    Designed for 3D modeling of animals, enabling the creation of a shape space from a few scans of diverse species.

πŸ‘€ Vision

  • Convolutional Neural Networks (CNNs) πŸ”—
    CNNs are specialized for image-related tasks by using convolution, pooling, and fully connected layers.
  • 3D CNNs πŸ”—
    Extend CNNs to process volumetric data (e.g., videos, MRI scans) by using 3D convolutional kernels.
  • U-Net πŸ”—
    A U-shaped network architecture designed for biomedical image segmentation, known for its efficient denoising and skip connections.
  • Inception πŸ”—
    Introduces multi-scale processing via parallel convolutions (1x1, 3x3, 5x5) for improved feature extraction.
  • VGG πŸ”—
    Evaluate the impact of increasing CNN depth using very small (3x3) filters to capture complex visual features.
  • ResNet πŸ”—
    Introduces residual learning with shortcut connections to enable training of very deep networks (up to 152 layers).
  • Vision Transformers (ViTs) πŸ”—
    Applies the self-attention mechanism to image patches, offering competitive performance on image recognition tasks.

πŸ“ Language Models

  • RNNs πŸ”—
    General recurrent neural networks for sequence modeling.
  • Bidirectional RNNs (BRNNs) πŸ”—
    Process sequences in both directions to leverage past and future context.
  • Encoder-Decoder Frameworks πŸ”—
    Used for tasks like machine translation by compressing sequences into a fixed-length vector.
  • LSTMs πŸ”—
    Introduces memory cells and gating mechanisms to capture long-term dependencies.
  • GRUs πŸ”—
    A streamlined variant of LSTMs merging input and forget gates into an update gate.
  • Attention Mechanisms πŸ”—
    Allows models to dynamically focus on different parts of the input sequence.
  • Transformers πŸ”—
    Utilize self-attention to process sequences without recurrence.
  • BERT πŸ”—
    Bidirectional Encoder Representations from Transformers for deep language understanding.
  • GPT Series:
  • PoseGPT πŸ”—
    Specialized for pose estimation in video generation.
  • GestureGPT πŸ”—
    Extends the GPT framework to generate realistic human gestures based on text or audio input.
  • MotionGPT πŸ”—
    Designed for generating motion sequences.

πŸ•’ Temporal Sequence Modeling

  • Temporal Convolutional Networks (TCNs) πŸ”—
    Use causal and dilated convolutions to model sequential data efficiently.
  • Transformer-XL πŸ”—
    Extends Transformers with a segment-level recurrence mechanism to capture long-range dependencies.
  • ConvLSTM πŸ”—
    Combines CNNs and LSTM units to capture both spatial and temporal dynamics in spatiotemporal data.

πŸ—£ Speech Models

  • WaveNet πŸ”—
    An autoregressive model for raw audio synthesis using dilated causal convolutions.
  • Tacotron πŸ”—
    A sequence-to-sequence TTS model that converts text to mel-spectrograms via attention.
  • Tacotron 2 πŸ”—
    Combines Tacotron with a WaveNet vocoder for end-to-end, high-fidelity speech synthesis.
  • FastSpeech πŸ”—
    A non-autoregressive TTS model using transformers for parallel synthesis to reduce latency.
  • FastSpeech 2 πŸ”—
    Improves FastSpeech by introducing variance predictors for pitch, energy, and duration for more natural speech.
  • Wave2Vec πŸ”—
    A self-supervised framework for learning robust speech representations directly from raw audio.
  • Wave2Vec 2.0 πŸ”—
    Enhances Wave2Vec with quantization and contextual embeddings to improve ASR performance.
  • HuBERT πŸ”—
    Uses clustering-based pseudo-labeling and masked prediction to learn effective speech representations.
  • Whisper πŸ”—
    A transformer-based model for multilingual ASR, translation, and transcription with zero-shot capabilities.
  • SeamlessM4T πŸ”—
    An end-to-end model for universal speech translation and generation that preserves speaker emotion via attention.

🎭 Additional Generative Models

  • GANs (Generative Adversarial Networks) πŸ”—
    An adversarial framework where a generator and discriminator engage in a minimax game to synthesize realistic data.
  • CycleGAN πŸ”—
    Enables unpaired image-to-image translation by enforcing cycle consistency between two domains.
  • Autoencoders
    A general framework that compresses input data into a latent representation and reconstructs it for unsupervised learning.
  • Variational Autoencoders (VAEs) πŸ”—
    Probabilistic autoencoders that regularize the latent space using KL divergence to generate new data samples.
  • Vector Quantized VAEs (VQ-VAEs) πŸ”—
    Enhances VAEs by discretizing the latent space with a codebook for more structured representations.
  • NeRF (Neural Radiance Fields) πŸ”—
    Learns an implicit 3D scene representation via volumetric rendering for novel view synthesis.
  • 3D Gaussian Splatting (3DGS) πŸ”—
    Represents 3D scenes with a collection of Gaussian functions for efficient real-time rendering.
  • Denoising Diffusion Probabilistic Models (DDPMs) πŸ”—
    Generates high-quality outputs by iteratively denoising data from a latent space.
  • ControlNet πŸ”—
    Augments diffusion models with auxiliary conditioning inputs for precise image generation.
  • DALL-E πŸ”—
    An autoregressive transformer that generates images from text by jointly modeling text and image tokens.

πŸ“Š Metrics

βœ… Quality and Realism of Generated Output

These metrics assess how natural, realistic, and perceptually convincing the generated content appears.

Metric Description Formula
FrΓ©chet Inception Distance (FID) Measures statistical distance between real and generated images. $\text{FID} = \lVert \mu_r - \mu_g \rVert^2 + \text{tr}(\Sigma_r + \Sigma_g - 2(\Sigma_r\Sigma_g)^{1/2})$
CLIP Score Evaluates semantic similarity between generated images and textual descriptions. $\text{CLIPScore} = \frac{t \cdot i}{\lVert t \rVert \lVert i \rVert}$
Mean Squared Error (MSE) Measures pixel-wise difference between generated and real images. $\text{MSE} = \frac{1}{n}\sum_{i=1}^{n}(x_i - y_i)^2$
Learned Perceptual Image Patch Similarity (LPIPS) Assesses perceptual similarity using deep feature embeddings. $\text{LPIPS}(x,y) = \sum_l \frac{1}{H_l W_l} \sum_{h=1}^{H_l}\sum_{w=1}^{W_l}\lVert \phi_l(x)^{h,w}-\phi_l(y)^{h,w} \rVert_2^2$
Identity Consistency Ensures identity preservation in generated faces by computing cosine similarity. $\text{IC} = \frac{1}{N}\sum_{i=1}^{N} \text{cosine-sim}\Bigl(f(x_i), f(y_i)\Bigr)$
FrΓ©chet Gesture Distance (FGD) Measures statistical differences between real and generated gesture distributions. $\text{FGD} = \lVert \mu_{\text{real}} - \mu_{\text{gen}} \rVert^2 + \text{tr}(\Sigma_{\text{real}} + \Sigma_{\text{gen}} - 2(\Sigma_{\text{real}}\Sigma_{\text{gen}})^{1/2})$
CLIP FrΓ©chet Inception Distance (CLIP FID) A CLIP-based extension of FID for assessing generated textures. $\text{CLIPFID} = \lVert \mu_{\text{CLIP,real}} - \mu_{\text{CLIP,gen}} \rVert^2 + \text{tr}(\Sigma_{\text{CLIP,real}} + \Sigma_{\text{CLIP,gen}} - 2(\Sigma_{\text{CLIP,real}} \Sigma_{\text{CLIP,gen}})^{1/2})$

πŸ”„ Diversity and Multimodality

These metrics assess whether the generative model produces diverse and varied outputs.

Metric Description Formula
Diversity Quantifies variation between independently sampled subsets of generated outputs. $\text{Diversity} = \frac{1}{N}\sum_{i=1}^{N}\lVert x_i - x'_i \rVert^2$
Multimodality Measures diversity of outputs within the same action class. $\text{Multimodality} = \frac{1}{C \cdot N}\sum_{c=1}^{C}\sum_{n=1}^{N}\lVert x_{c,n} - x'_{c,n} \rVert^2$
Average Pairwise Distance (APD) Evaluates diversity across generated samples. $\text{APD} = \frac{1}{N(N-1)}\sum_{i\neq j} \lVert x_i - x_j \rVert$

🎯 Relevance and Accuracy

These metrics assess how well the generated content aligns with ground truth data.

Metric Description Formula
Mean Absolute Joint Error (MAJE) Measures positional accuracy of generated motion. $\text{MAJE} = \frac{1}{n}\sum_{i=1}^{n}\lvert x_i - y_i \rvert$
Probability of Correct Keypoints (PCK) Evaluates the percentage of correct keypoint predictions. $\text{PCK} = \frac{\text{number of correct keypoints}}{\text{number of total keypoints}}$
Beat Consistency (BC) Measures alignment between motion and speech rhythms. $\text{BC} = \frac{1}{T}\sum_{t=1}^{T}\cos\bigl(\text{motion-beats}(t), \text{speech-beats}(t)\bigr)$
CLIP-Var Quantifies texture consistency across different views. $\text{CLIP-Var} = 1 - \min_{i \neq j}\frac{f_i \cdot f_j}{\lVert f_i \rVert \lVert f_j \rVert}$
Multimodal Distance (MM-Distance) Measures alignment between generated motion and textual descriptions. $\text{MM-Distance} = \sqrt{\frac{1}{N}\sum_{n=1}^{N}\lVert f_{a,n} - f_{b,n} \rVert^2}$

πŸƒ Physical Plausibility and Interaction

These metrics assess whether generated motion adheres to real‑world physical constraints.

Metric Description Formula
Foot Skating (FS) Detects unnatural foot movements in generated motion. $\text{FS} = \frac{1}{T}\sum_{t=1}^{T}\lVert \text{foot-velocity}(t) - \text{expected-velocity}(t) \rVert$
Mean Acceleration Difference (MAD) Evaluates smoothness of generated motion by comparing acceleration. $\text{MAD} = \frac{1}{n}\sum_{i=1}^{n}\lVert a_i^{\text{gen}} - a_i^{\text{gt}} \rVert^2$

⚑️ Efficiency and Computational Metrics

These metrics evaluate the computational cost of generative models.

Metric Description Formula
Execution Time Measures the time required to generate outputs. $\text{Execution Time} = \text{End Time} - \text{Start Time}$
Kernel Inception Distance (KID) Measures output similarity using kernel functions. $\text{KID} = \frac{1}{n(n-1)} \sum_{i \neq j} k(x_i, x_j) + \frac{1}{m(m-1)} \sum_{i \neq j} k(y_i, y_j) - \frac{2}{nm} \sum_{i=1}^{n} \sum_{j=1}^{m} k(x_i, y_j)$

|


πŸ‘¨ Face

Focuses on realistic face generation, facial reenactment, and attribute editing using GANs, diffusion models, and specialized frameworks.

πŸ—‚ Datasets

🏷️ Name πŸ“Š Statistics πŸ” Modalities πŸ”— Link
RaFD More than 8,000 images. Images of 67 models displaying eight facial expressions, photographed from five different angles. πŸ–ΌοΈ Images RaFD
MPIE Over 750,000 images with a broad range of variations in facial expressions, head poses, and lighting conditions. πŸ–ΌοΈ Images MPIE
VoxCeleb1 More than 100,000 utterances from 1,251 celebrities. πŸ”Š Audio, πŸŽ₯ Video VoxCeleb1
VoxCeleb2 Over 1 million utterances from 6,112 celebrities. πŸ”Š Audio, πŸŽ₯ Video VoxCeleb2
CelebA-HQ 30,000 images at a resolution of 1024Γ—1024, providing detailed facial images of celebrities. πŸ–ΌοΈ Images CelebA-HQ
FaceForensics Over 1,000 video sequences with various face manipulations. πŸŽ₯ Video FaceForensics
300-VW About 300 videos of faces in various scenarios and lighting conditions. πŸŽ₯ Video 300-VW
FFHQ 70,000 images with extensive diversity, capturing various facial features, accessories, and environments. πŸ–ΌοΈ Images FFHQ
AffectNet Over 1 million images collected from the internet, with annotations for 11 different facial expressions and emotions. πŸ–ΌοΈ Images AffectNet
MΒ³ CelebA Over 150K facial images annotated with semantic segmentation, facial landmarks, and captions in multiple languages. πŸ–ΌοΈ Images, πŸ“ Text MΒ³ CelebA
CUB Over 11,000 images of 200 bird species, each annotated with various attributes like species, part locations, and bounding boxes. πŸ–ΌοΈ Images CUB
CelebA-Dialog 202,599 face images from 10,177 identities, annotated with 5 fine-grained attributes: Bangs, Eyeglasses, Beard, Smiling, Age, along with captions and user editing requests. πŸ–ΌοΈ Images, πŸ“ Text CelebA-Dialog
LS3D-W A dataset of 230,000 3D facial landmarks. πŸ–ΌοΈ Images LS3D-W
MERL-RAV Over 19,000 face images with diverse head pose, all annotated by 68 point landmarks and visibility status. πŸ”Š Audio, πŸŽ₯ Video MERL-RAV
AFLW2000-3D Contains 2000 images with 68-point 3D facial landmarks, used to evaluate 3D facial landmark detection models with diverse head poses. πŸ–ΌοΈ Images, πŸ”· 3D/Point Cloud Data AFLW2000-3D
FaceScape Over 18K textured 3D faces, captured from 938 subjects, each with 20 specific expressions. πŸ”· 3D/Point Cloud Data FaceScape

πŸ€– Models

  • StyleGAN πŸ”—
    A generative adversarial network known for producing high-quality, photorealistic images. It serves as a backbone for many face generation and editing tasks.

  • ResNet πŸ”—
    A convolutional neural network architecture that provides robust feature extraction, often used as a backbone in face generation pipelines.

  • Dual-Generator (DG) πŸ”—
    A large-pose face reenactment model composed of two modules: the ID-Preserving Shape Generator (IDSG), which uses 3D landmark detection to capture local shape variations, and the Reenacted Face Generator (RFG), based on StarGAN2, to produce the final output.

  • Feature Disentanglement and Identity Transfer Model πŸ”—
    An approach that bypasses the need for pre-trained structural priors by using a Feature Disentanglement module with Feature Displacement Fields (FDF) and an Identity Transfer (IdT) module based on self-attention to align source identity with target attributes.

  • Unified Neural Face Reenactment Pipeline πŸ”—
    A pipeline that leverages a 3D shape model to obtain disentangled representations of pose, expression, and identity, mapping changes in these parameters to the latent space of a fine-tuned StyleGAN2 for accurate face reenactment.

  • Controllable 3D Generative Adversarial Face Model πŸ”—
    A model that employs a Supervised Auto-Encoder (SAE) to disentangle identity and expression into separate latent spaces, using a Conditional GAN (cGAN) for smooth and controllable expression intensity.

  • AlbedoGAN πŸ”—
    A self-supervised 3D generative face model that synthesizes high-resolution albedo and detailed 3D geometry. It refines facial textures (e.g., wrinkles) via a mesh refinement displacement map integrated with the FLAME model, and leverages CLIP for text-guided editing.

  • IricGAN (Information Retention and Intensity Control GAN) πŸ”—
    A face editing method designed to preserve identity and semantic details while enabling controlled modifications of facial attributes. It features a Hierarchical Feature Combination (HFC) module and an Attribute Regression Module (ARM) for smooth intensity control.

  • GSmoothFace πŸ”—
    A speech-driven talking face generation framework based on fine-grained 3D face modeling. It addresses lip synchronization and generalizability across speakers by introducing bias-based cross-attention and a Morphology Augmented Face Blending (MAFB) module.

  • Adaptive Latent Editing Model πŸ”—
    A face editing approach that uses adaptive and nonlinear latent space transformations to flexibly learn transformations for complex, conditional edits while maintaining image quality and realism.

  • StyleT2I πŸ”—
    A text-to-image synthesis model that improves compositionality and fidelity. It uses a CLIP-guided Contrastive Loss and a Text-to-Direction module to align StyleGAN’s latent codes with text descriptions, enhancing attribute control.

  • Hybrid Neural-Graphics Face Generation Model πŸ”—
    A model that combines neural networks (using StyleGAN2 for texture and background synthesis) with fixed-function graphics components (such as a differentiable renderer and the FLAME 3D head model) to achieve interpretable control over facial attributes.

  • M3Face πŸ”—
    A framework leveraging multimodal and multilingual inputs for both face generation and editing. It uses the Muse model to generate segmentation masks or landmarks from text and applies ControlNet architectures to refine the results, streamlining the process into a single step.

  • GuidedStyle πŸ”—
    A framework for semantic face editing on StyleGAN that employs a pre-trained attribute classifier as a knowledge network and sparse attention to guide layer-specific modifications, ensuring that only targeted facial features are changed.

  • AnyFace πŸ”—
    The first free-style text-to-face synthesis model capable of handling open-world text descriptions. It features a two-stream architecture that decouples text-to-face generation from face reconstruction, using CLIP-based cross-modal distillation and a Diverse Triplet Loss to enhance alignment and diversity.

  • HiFace πŸ”—
    A 3D face reconstruction model that decouples static (e.g., skin texture) and dynamic (e.g., wrinkles) details using its SD-DeTail Module. It extracts shape and detail coefficients via ResNet-50 and uses MLPs with AdaIN to generate detailed displacement maps for realistic reconstructions and animations.


πŸ˜ƒ Expression

Covers emotion-driven synthesis, facial expression retargeting, and multimodal methods that capture nuanced nonverbal cues.

πŸ—‚ Datasets

🏷️ Name πŸ“Š Statistics πŸ” Modalities πŸ”— Link
BEAT 76 hours of speech data, paired with 52D facial blend shape weights; 30 speakers performing in 8 distinct emotional styles across 4 languages. πŸ”Š Audio, πŸ–ΌοΈ Images, πŸŽ₯ Video, πŸ“ Text BEAT
MEAD A talking-face video corpus featuring 60 actors and actresses talking with eight different emotions at three intensity levels; approximately 40 hours of audio-visual clips per person and view. πŸŽ₯ Video, πŸ”Š Audio, πŸ“ Text, πŸ–ΌοΈ Images MEAD
TEAD 50,000 quadruples, each including text, emotion tags, Action Units, blend shape weights, and situation sentences. πŸ“ Text, πŸ–ΌοΈ Images -
JAFFE 213 images of 10 Japanese female models posing 7 facial expressions, annotated with average semantic ratings from 60 annotators. πŸ–ΌοΈ Images JAFFE
MMI Facial Expression Over 2900 videos and high-resolution still images of 75 subjects. πŸŽ₯ Video, πŸ–ΌοΈ Images, πŸ“ Text MMI
Multiface High-quality recordings of the faces of 13 identities. An average of 23,000 frames per subject; each frame includes roughly 160 different camera views. πŸ–ΌοΈ Images, πŸ”Š Audio, πŸ“‹ Tabular Data Multiface
ICT FaceKit 4,000 high-resolution facial scans of 79 subjects (34 female, 45 male) aged 18–67, plus 99 full-head scans and 26 expressions per subject. πŸ”· 3D/Point Cloud Data, πŸ–ΌοΈ Images ICT FaceKit
TikTok Dataset Over 300 single-person dance videos (10–15 seconds each), extracted at 30fps, yielding 100K+ frames. Includes segmented images and computed UV coordinates. πŸŽ₯ Video, πŸ–ΌοΈ Images, πŸ“‹ Tabular Data TikTok Dataset
Everybody Dance Now Long single-dancer videos for training and evaluation; includes both self-filmed videos and short YouTube videos. πŸŽ₯ Video, πŸ“‹ Tabular Data Everybody Dance Now
Obama Weekly Footage 17 hours of video footage, nearly two million frames, spanning eight years. πŸŽ₯ Video, πŸ”Š Audio Obama Weekly Footage
VoxCeleb2 Over 1 million utterances from over 6,000 speakers, collected from YouTube videos with 61% male speakers. πŸ”Š Audio, πŸŽ₯ Video VoxCeleb2
BIWI Over 15K images of 20 people recorded with a Kinect while turning their heads around freely. πŸ”· 3D/Point Cloud Data, πŸ–ΌοΈ Images, πŸ“‹ Tabular Data, πŸ“ Text BIWI
VOCASET About 29 minutes of high-fidelity 4D scans captured at 60fps, synchronized with audio; features 12 speakers with 40 sequences per subject (each sequence consists of English sentences lasting 3–5 seconds). πŸ”· 3D/Point Cloud Data, πŸ”Š Audio VOCASET
SHOW Contains SMPLX parameters of 4 persons reconstructed from videos; includes 88-frame motion clips for training and validation. πŸŽ₯ Video, πŸ”Š Audio, πŸ–ΌοΈ Images, πŸ“‹ Tabular Data SHOW

πŸ€– Models

πŸŽ™ Speech-Driven & Multimodal Expression Generation

  • Joint Audio-Text Model for 3D Facial Animation πŸ”—
    Integrates a GPT-2-based text encoder with a dilated convolution audio encoder to improve upper-face expressiveness and lip synchronization. Lacks head and gaze control.

  • VOCA πŸ”—
    A speech-driven facial animation model used as a baseline for lip synchronization and expressiveness.

  • MeshTalk πŸ”—
    A model for speech-driven 3D facial animation, serving as a comparison baseline for upper-face motion and expressiveness.

  • CSTalk πŸ”—
    Employs a transformer-based encoder to capture correlations across facial regions, enhancing emotional speech-driven animation; limited to five discrete emotions.

  • ExpCLIP πŸ”—
    Aligns text, image, and expression embeddings via CLIP encoders, enabling expressive speech-driven facial animation from text/image prompts by leveraging the TEAD dataset and Expression Prompt Augmentation.

  • Style-Content Disentangled Expression Model πŸ”—
    Enhances personalization in facial animation by disentangling style and content representations, thereby improving identity retention and transition smoothness. (Compared to FaceFormer.)

  • FaceFormer πŸ”—
    A speech-driven facial animation model noted for its audio-visual synchronization, used as a baseline for comparison.

  • AdaMesh πŸ”— Introduces an Expression Adapter (MoLoRA-enhanced) and a Pose Adapter (retrieval-based) for personalized speech-driven facial animation, achieving improved expressiveness, diversity, and synchronization compared to models such as GeneFace and Imitator.

  • FaceXHuBERT πŸ”—
    Explores disentangling emotional expressiveness through multimodal representations as part of advanced speech-driven facial animation.

  • FaceDiffuser πŸ”—
    Utilizes stochastic approaches to enhance motion variability and disentangle emotional expressiveness in facial animation.

πŸ” Expression Retargeting & Motion Transfer

  • Neural Face Rigging (NFR) πŸ”—
    Automates 3D mesh rigging by encoding interpretable deformation parameters, enabling fine-grained facial expression transfer.

  • MagicPose πŸ”—
    Leverages diffusion models for 2D facial expression retargeting, balancing identity preservation and motion control through Multi-Source Attention and Pose ControlNet.

  • DiffSHEG πŸ”—
    Pioneers joint 3D facial expression and gesture synthesis with speech-driven alignment, employing Fast Out-Painting-based Partial Autoregressive Sampling (FOPPAS) for seamless, real-time motion generation.

  • DreamPose πŸ”—
    A baseline model for 2D facial expression retargeting used for comparison with MagicPose.

  • Disco πŸ”—
    Serves as a comparison baseline in 2D facial expression retargeting, noted for its identity retention and generalization capabilities.

  • TalkSHOW πŸ”—
    A speech-driven facial animation model referenced as a baseline for comparison with DiffSHEG.

  • LS3DCG πŸ”—
    A model for 3D facial expression and gesture synthesis used as a baseline when comparing motion realism and synchronization.

  • DiffuseStyleGesture πŸ”—
    Referenced as a baseline model for facial expression and gesture synthesis in comparison to DiffSHEG.


πŸ–Ό Image

Explores diffusion-based methods, VAEs, and other generative techniques to produce high-fidelity images and textures for animation backgrounds and elements.

πŸ—‚ Datasets

🏷️ Name πŸ“Š Statistics πŸ” Modalities πŸ”— Link
LAION-5B 5,85 billion CLIP-filtered image-text pairs πŸ–ΌοΈ Images, πŸ“ Text LAION-5B
LAION-400M 400M English (image, text) pairs πŸ–ΌοΈ Images, πŸ“ Text LAION-400M
LAION-Aesthetics v2 1,2B aesthetics scores of β‰₯4.5
939M aesthetics scores of β‰₯4.75
600M aesthetics scores of β‰₯5
12M aesthetics scores of β‰₯6
3M aesthetics scores of β‰₯6.25
625K aesthetics scores of β‰₯6.5
πŸ–ΌοΈ Images, πŸ“ Text LAION-Aesthetics v2
Open Images V7 9M images annotated with image-level labels, object bounding boxes, object segmentation masks, visual relationships, and localized narratives πŸ–ΌοΈ Images Open Images V7
COYO 747M image-text pairs πŸ–ΌοΈ Images, πŸ“ Text COYO
Conceptual Captions 3.3M images annotated with captions πŸ–ΌοΈ Images, πŸ“ Text Conceptual Captions
COCO 330K images (>200K labeled)
1.5 million object instances
80 object categories
91 stuff categories
5 captions per image
250,000 people with key points
πŸ–ΌοΈ Images, πŸ“ Text COCO
ShareGPT 100k highly descriptive image-caption πŸ–ΌοΈ Images, πŸ“ Text ShareGPT
ADE20K 20,210 images in the training set
2,000 images in the validation set
3,000 images in the testing set
πŸ–ΌοΈ Images ADE20K

πŸ€– Models

πŸ”§ Fine-Tuning & Regularization

  • Spectral Shift Fine-Tuning πŸ”—
    Introduces a compact parameter space called β€œspectral shift” for diffusion model fine-tuning. It reduces overfitting and storage inefficiency while achieving comparable or superior results in both single- and multi-subject generation. The method also employs the Cut Mix-Unmix data augmentation technique for improved multi-subject quality and acts as a regularizer enabling applications like single-image editing.

  • Control via Zero Convolutions (ControlNet) πŸ”—
    Addresses the limited spatial control of text-to-image models by locking large pre-trained diffusion models and reusing their deep encoding layers as a robust backbone. Connected via β€œzero convolutions” (zero-initialized convolution layers), this approach progressively grows parameters from zero to prevent harmful noise during fine-tuning, thereby facilitating diverse conditional controls.

βœ‚ Image Editing & Disentanglement

  • Lightweight Disentanglement for Image Editing πŸ”—
    Explores the inherent disentanglement properties of stable diffusion models. By partially replacing text embeddings from a style-neutral description with one that reflects the desired style, a lightweight algorithm (optimizing only 50 parameters) is introduced for improved style matching and content preservation, outperforming more complex fine-tuning baselines.

  • SmartEdit πŸ”—
    Frames image editing as a supervised learning problem by generating a paired training dataset of text editing instructions with before/after images. Built on the Stable Diffusion framework, it successfully handles challenging edits such as object replacement, seasonal changes, background modifications, and alterations of material attributes or artistic mediums.

  • Classifier-Free Guidance πŸ”—
    Employs a modified classifier-free guidance strategy in two ways: by introducing model-based classifier-free guidance and by planting a content β€œseed” early during denoising. Coupled with a patch-based fine-tuning strategy on latent diffusion models (LDMs), this approach enables generation at arbitrary resolutions while leveraging large pre-trained models.

  • Null Embedding Optimization for High-Fidelity Reconstructions πŸ”—
    Observes that DDIM inversion provides a good starting point but struggles with classifier-free guidance. By optimizing the unconditional null embedding used in classifier-free guidance, this method achieves high-fidelity reconstructions without additional tuning of the model or conditional embeddings, thereby preserving editing capabilities.

  • Unified Diffusion Model Editing Algorithm πŸ”—
    Follows a three-stage approach: (i) optimizing text embeddings to match a given image, (ii) fine-tuning diffusion models for improved image alignment, and (iii) linearly interpolating between optimized and target text embeddings. This unified algorithm enables precise editing of diffusion models, aiming to make them more responsible and beneficial.

  • Debiasing Text-to-Image Diffusion Models πŸ”—
    Enables targeted debiasing, removal of potentially copyrighted content, and moderation of offensive concepts using only text descriptions. This editing methodology can be applied to any linear projection layer by replacing pre-trained weights while preserving key concepts.

πŸ‘½ Multimodal Conversations & Visual Understanding

  • AlignGPT πŸ”—
    Comprises a multimodal large language model (MLLM) for enhanced multimodal perception. An accompanying AlignerNet bridges the MLLM to the diffusion U-Net image decoder, enabling coherent integration of textual and visual information.

  • KOSMOS-G πŸ”—
    Offers seamless concept-level guidance from interleaved input to the image decoder. Serving as an alternative to CLIP, it facilitates effective image generation by guiding the diffusion process with interleaved multimodal cues.

  • MM-REACT πŸ”—
    Presents a unified approach that synergizes multimodal reasoning and action to tackle complex visual understanding tasks. Extensive zero-shot experiments demonstrate its capabilities in multi-image reasoning, multi-hop document understanding, and open-world concept comprehension.


πŸ‘€ Avatar

Reviews approaches for both 2D and 3D avatar creation, emphasizing lifelike digital representations with detailed facial expressions and body dynamics.

πŸ—‚ Datasets

🏷️ Name πŸ“Š Statistics πŸ” Modalities πŸ”— Link
WildAvatar Over 10,000 human subjects; extracted from YouTube; significantly richer than previous datasets for 3D human avatar creation πŸŽ₯ Video, πŸ”· 3D/Point Cloud Data, πŸ”Š Audio WildAvatar
ZJU-MoCap Multi-camera system with 20+ synchronized cameras; includes SMPL-X parameters for detailed motion capture of body, hand, and face; complex actions such as twirling, Taichi, and punching πŸŽ₯ Video, πŸ”· 3D/Point Cloud Data ZJU-MoCap
TalkSHOW 26.9 hours of in-the-wild talking videos from 4 speakers; expressive 3D whole-body meshes reconstructed at 30 fps, synchronized with audio at 22 kHz πŸ”Š Audio, πŸ”· 3D/Point Cloud Data TalkSHOW
HuMMan 1,000 human subjects, 400k sequences, 60M frames; include point clouds, SMPL parameters, and textured meshes for multimodal sensing πŸŽ₯ Video, πŸ”· 3D/Point Cloud Data HuMMan
BUFF 6 subjects performing motions in two clothing styles; 13,632 3D scans with high-resolution ground-truth minimally-clothed shapes πŸ”· 3D/Point Cloud Data BUFF
AMASS Combines 15 motion capture datasets into a unified framework with over 42 hours of motion data; 346 subjects and 11,451 motions with SMPL pose parameters, 3D shape parameters, and soft-tissue coefficients πŸ”· 3D/Point Cloud Data AMASS
3DPW 60 video sequences with accurate 3D poses using video and IMU data; 18 re-poseable 3D body models with different clothing variations πŸŽ₯ Video, ⏱️ Time-Series Data, πŸ”· 3D/Point Cloud Data 3DPW
AIST++ 10,108,015 frames of 3D key points with corresponding images; 1,408 dance motion sequences spanning 10 dance genres with synchronized music πŸŽ₯ Video, πŸ”Š Audio, πŸ”· 3D/Point Cloud Data AIST++
RenderMe-360 Over 243 million head frames from 500 identities; includes FLAME parameters, UV maps, action units, textured meshes, and diverse annotations πŸŽ₯ Video, πŸ”· 3D/Point Cloud Data RenderMe-360
PuzzleIOI 41 subjects with nearly 1,000 Outfit-of-the-Day (OOTD) configurations; includes paired ground-truth 3D body scans for challenging partial photos πŸ–ΌοΈ Images, πŸ”· 3D/Point Cloud Data, πŸ“ Text PuzzleIOI

πŸ€– Models

πŸ” CLIP-Guided Models

  • AvatarCLIP πŸ”—
    A zero-shot framework for generating and animating 3D avatars from natural language descriptions. It uses a shape VAE for initial geometry generation guided by CLIP and integrates NeuS for high-quality geometry and photorealistic rendering. In the motion phase, candidate poses are selected via CLIP and a motion VAE synthesizes smooth motions.

  • DreamField πŸ”—
    Adapts NeRF for text-driven 3D object generation. While it facilitates text-to-3D synthesis, it struggles with capturing detailed geometry.

  • Text2Mesh πŸ”—
    Stylizes existing meshes using CLIP guidance. It aims for text-driven mesh modifications but faces challenges with stability and flexibility when handling diverse text descriptions.

🧩 Implicit Function-Based Models

  • PIFu (Pixel-Aligned Implicit Function) πŸ”—
    Reconstructs detailed 3D surfaces from single-view 2D images by projecting 3D points into 2D space to extract pixel-aligned features via CNNs, which are then processed by an MLP for high-resolution surface reconstructions.

  • PIFuHD πŸ”—
    Enhances PIFu by incorporating multi-scale feature extraction, leading to improved global shape understanding and finer surface details.

  • ARCH (Animatable Reconstruction of Clothed Humans) πŸ”—
    Reconstructs detailed 3D models of clothed individuals from single RGB images. It transforms poses into a canonical space using a parametric body model and employs an implicit surface representation to capture fine details such as clothing folds.

  • ARCH++ πŸ”—
    An enhanced version of ARCH that refines geometry encoding and boosts clothing details to produce photorealistic, animatable avatars.

  • PaMIR (Parametric Model-Conditioned Implicit Representation) πŸ”—
    Combines a parametric SMPL body model with an implicit surface representation to reconstruct 3D humans from single RGB images. It uses a depth-ambiguity-aware loss and refines SMPL parameters during inference for better alignment.

  • TADA (Text to Animatable Dynamic Avatar) πŸ”—
    Generates high-fidelity, animatable 3D avatars directly from text prompts. It leverages an upsampled SMPL-X model and learnable displacements, optimizing geometry and texture via Score Distillation Sampling losses, with additional detail enhancement through partial mesh subdivision.

  • GETAvatar (Generative Textured Meshes for Animatable Human Avatars) πŸ”—
    Directly produces high-fidelity, explicitly textured 3D meshes. It represents human bodies using an articulated 3D mesh and generates a signed distance field (SDF) in canonical space, which is deformed to match the target shape and pose via SMPL-based transformations. A normal field trained on 3D scans enhances fine geometric details.

  • RodinHD πŸ”—
    Creates 3D avatars from a single portrait image by constructing a detailed 3D blueprint (triplane) that captures the avatar’s shape, textures, and fine details. A shared neural decoder then converts this blueprint into an image, with a cascaded diffusion model generating new triplanes based on the portrait.

πŸŽ₯ NeRF-Based Methods

  • HumanNeRF πŸ”—
    Pioneers the use of deformation fields for dynamic human models from monocular images, enabling the mapping of points from observation to canonical space.

  • Neural Body πŸ”—
    Introduces structured latent codes anchored to SMPL model vertices, processed via SparseConvNet, to regularize dynamic human modeling.

  • Neural Human Performer πŸ”—
    Captures dynamic human information directly in the observation space using a skeletal feature bank and transformer modules.

  • Vid2Avatar πŸ”—
    Jointly models human subjects and scene backgrounds using two separate neural radiance fields, enhancing realism in avatar generation.

  • DreamHuman πŸ”—
    Generates animatable 3D human avatars from textual descriptions by combining NeRF with the imGHUM body model. It uses human body shape statistics for anatomical correctness and incorporates semantic zooming for detailed regions such as faces and hands.

🌈 Diffusion-Based Methods

  • Personalized Avatar Scene (PAS) [πŸ”—]
    Generates customized 3D avatars in various poses and scenes based on text descriptions. It employs a diffusion-based transformer to generate 3D body poses conditioned on text.

  • 3D Head Avatar via 3DMM & Diffusion πŸ”—
    Combines a parametric 3D Morphable Model of the head (using FLAME [153]) with diffusion models to jointly optimize geometry and texture for generating 3D head avatars from text prompts.

  • Make-Your-Anchor πŸ”—
    Introduces a novel approach for generating 2D anchor-style avatars capable of realistic full-body motion and expression. It utilizes a Structure-Guided Diffusion Model (SGDM) to ensure coherent and expressive avatar generation.

πŸ”€ Hybrid Methods

  • DreamAvatar πŸ”—
    Integrates shape priors, diffusion models, and NeRF architecture within a dual-observation-space (DOS) framework. Leveraging SMPL for anatomical guidance and employing joint optimization with specialized head-focused VSD loss (using ControlNet [310]), it ensures structurally consistent avatars with controllable shape modifications. While it outperforms methods like DreamWaltz [111] in geometric accuracy, it currently lacks animation capabilities and may inherit biases from pretrained diffusion models.

  • DreamWaltz πŸ”—
    Referenced as a comparative baseline, this model illustrates limitations in animation capabilities and inherited biases when compared to hybrid approaches like DreamAvatar.


🀝 Gesture

Examines methods for generating human-like gestures and co-speech movements, critical for interactive and immersive animations.

πŸ—‚ Datasets

🏷️ Name πŸ“Š Statistics πŸ” Modalities πŸ”— Link
IEMOCAP 151 recorded dialogue videos, with 2 speakers per session, totaling 302 videos. Annotated for 9 emotions and valence, arousal, and dominance. Contains approximately 12 hours of audiovisual data. πŸŽ₯ Video, πŸ”Š Audio, πŸ“ Text, πŸ“‹ Tabular Data IEMOCAP
SaGA 25 dialogues between interlocutors (50 total). Language: German. Published speakers: 6, unpublished speakers: 19. Annotated gestures: 1,764 (total corpus). Total video duration: 1 hour. πŸŽ₯ Video, πŸ”Š Audio, πŸ“‹ Tabular Data SaGA
Creative-IT Data from 16 actors (male and female). Affective dyadic interactions range from 2 to 10 minutes each. Approximately 8 sessions of audiovisual data were released. πŸŽ₯ Video, πŸ”Š Audio, πŸ“ Text, πŸ“‹ Tabular Data CreativeIT
CMU Panoptic 3D facial landmarks from 65 sequences (5.5 hours). Contains 1.5 million 3D skeletons. πŸŽ₯ Video, πŸ”Š Audio, πŸ“ Text CMU Panoptic
Speech-Gesture A 144-hour dataset featuring 10 speakers. Includes frame-by-frame, automatically detected pose annotations. πŸŽ₯ Video, πŸ”Š Audio Speech-Gesture
Talking With Hands 16.2M 16.2 million frames (50 hours) of two-person, face-to-face spontaneous conversations. Strong covariance in arm and hand features. πŸŽ₯ Video, πŸ”Š Audio Talking With Hands
PATS 25 speakers, 251 hours of data, approximately 84,000 intervals. Mean interval length: 10.7 seconds. πŸŽ₯ Video, πŸ”Š Audio, πŸ“ Text PATS
Trinity Speech-Gesture II 244 minutes of motion capture and audio (23 takes). Includes one male native English speaker. The skeleton consists of 69 joints. πŸŽ₯ Video, πŸ”Š Audio, πŸ“‹ Tabular Data Trinity
SaGA++ 25 recordings, totaling 4 hours of data. πŸŽ₯ Video, πŸ”Š Audio, πŸ“ Text, πŸ“‹ Tabular Data SaGA++
ZEGGS 67 monologue sequences with 19 different motion styles. Performed by a female actor speaking in English. Total duration: 134.65 minutes. πŸŽ₯ Video, πŸ”Š Audio ZEGGS
BEAT 76 hours of 3D motion capture data from 30 speakers. Covers 8 emotions and 4 languages. Includes 32 million frame-level emotion and semantic relevance annotations. πŸŽ₯ Video, πŸ”Š Audio, πŸ“ Text, πŸ“‹ Tabular Data BEAT
BEAT2 60 hours of mesh-level, motion-captured co-speech gesture data. Integrates SMPL-X body and FLAME head parameters. Enhances modeling of head, neck, and finger movements. πŸŽ₯ Video, πŸ”Š Audio, πŸ“‹ Tabular Data BEAT2
GAMT 176 video clips of volunteers using math terms and gestures. Covers 8 classes of mathematical terms and gestures. πŸŽ₯ Video, πŸ”Š Audio, πŸ“ Text GAMT
SeG 208 types of global semantic gestures. 544 motion files recorded from a male performer. Each gesture is represented in 2.6 variations on average. πŸŽ₯ Video, πŸ”Š Audio, πŸ“‹ Tabular Data SeG
DND Group Gesture 6 hours of gesture data from 5 individuals playing Dungeons & Dragons. Recorded over 4 sessions (total duration: 6 hours). Includes beat, iconic, deictic, and metaphoric gestures. πŸŽ₯ Video, πŸ”Š Audio, πŸ“‹ Tabular Data DND

πŸ€– Models

πŸ›  Traditional & Parametric Approaches

  • Parameter-Based Procedural Animation πŸ”—
    Uses high-level control parameters (e.g., emotion, speech intensity, rhythm) to select and interpolate predefined keyframes, yielding smooth and coherent gesture sequences.

  • Blendshape Models πŸ”—
    Generates detailed hand and finger gestures by blending a set of predefined base shapes using weighted interpolation, enabling fine-grained control and smooth transitions.

🧠 Deep Learning-Based Models

  • GestureGAN πŸ”—
    Employs a GAN-based generator-discriminator framework to synthesize realistic gesture sequences conditioned on audio inputs, capturing dynamic hand gestures effectively.

  • Speech2Gesture πŸ”—
    Generates co-speech gestures directly from speech features using LSTM/RNN architectures, effectively modeling temporal dependencies between speech and gesture.

  • StyleGestures πŸ”—
    Utilizes an encoder-decoder architecture with style tokens and Transformers to capture individual speaker styles, enabling personalized gesture synthesis.

  • Audio-Driven Adversarial Gesture Generation πŸ”—
    Combines GANs with Conditional Variational Autoencoders (CVAE) to align audio and motion features in a shared latent space, resulting in nuanced, audio-driven gestures.

  • GestureDiffuCLIP πŸ”—
    Leverages a diffusion process guided by CLIP for semantic alignment to iteratively refine gesture sequences, producing highly expressive gestures.

  • ZeroEGGS πŸ”—
    Implements a zero-shot paradigm for generating gestures based solely on speech, using example-based learning to generalize across unseen gestural styles.

  • GestureMaster πŸ”—
    Utilizes a graph neural network (GNN) framework to model spatial and temporal dependencies in gesture sequences, enhancing naturalistic hand and body gesture synthesis.

  • ExpressGesture πŸ”—
    Integrates emotion recognition with gesture generation pipelines, creating gestures that reflect both the content of speech and underlying sentiment.

  • MocapNET πŸ”—
    Bridges traditional motion capture with neural synthesis by combining 2D pose estimation and 3D gesture reconstruction using multimodal motion capture datasets.

  • CSMP πŸ”—
    A diffusion-based co-speech gesture generation model that leverages joint text and audio representations to capture intricate inter-modal relationships.

  • ZS-MSTM πŸ”—
    Introduces a zero-shot style transfer method for gesture animation through adversarial disentanglement, separating style and content features for effective style transfer across speakers.

πŸš€ Transformer-Based Models

  • Gesticulator πŸ”—
    Employs a multimodal Transformer architecture to generate contextually relevant gestures conditioned on both text and audio inputs, aligning with co-speech dynamics.

  • Mix-StAGE πŸ”—
    Uses an attention-based encoder-decoder with a style encoder and mixed spatial-temporal attention mechanisms to capture dynamic, expressive gestures.

  • SAGA (Style and Grammar-Aware Gesture Generation) πŸ”—
    Combines an LSTM-based encoder-decoder with a Transformer-based grammar encoder to align gestures accurately with linguistic content, integrating both style and grammatical cues.

  • Cross-Modal Transformer πŸ”—
    Leverages cross-attention mechanisms to fuse diverse modalities (text, audio, video), enhancing the coherence and contextual alignment of generated gestures.

  • DiM-Gesture πŸ”—
    Introduces an adaptive layer normalization mechanism (Mamba-2) to adjust to different speakers, focusing on generating realistic co-speech gestures from audio.

  • AMUSE πŸ”—
    Utilizes a disentangled latent diffusion technique to separate emotional expressions from gestures, enabling control over emotional aspects via a multi-stage training pipeline.

  • FreeTalker πŸ”—
    Employs a diffusion-based framework with classifier-free guidance and a generative prior (DoubleTake) to produce natural transitions between gesture clips, extending beyond co-speech gestures.

  • CoCoGesture πŸ”—
    Addresses long-sequence gesture generation with a Transformer-based diffusion model that uses a large dataset (GES-X) and a mixture-of-experts framework to effectively align gestures with human speech.

  • DiffuseStyleGestures πŸ”—
    Integrates audio, text, speaker IDs, and seed gestures within a diffusion-based approach to produce stylistically diverse co-speech gesture outputs.

  • DiffuseStyleGesture+ πŸ”—
    Builds upon DiffuseStyleGestures by further refining gesture synthesis through advanced multimodal integration and specialized attention mechanisms for personalized outputs.

  • ViTPose πŸ”—
    Applies Vision Transformers to human pose estimation, providing a robust foundation for gesture synthesis by accurately capturing pose dynamics.

  • Gesture Motion Graphs πŸ”—
    Utilizes graph-based modeling for few-shot gesture reenactment, effectively representing motion sequences and their dependencies.

  • DiffSHEG πŸ”—
    Adopts a diffusion-based approach for real-time speech-driven 3D expression and gesture generation, leveraging joint text and audio representations for coherent outputs.

  • C2G2 πŸ”—
    Emphasizes controllability in co-speech gesture generation by using modular components to handle different aspects of gesture synthesis.

  • DiffuGesture πŸ”—
    Focuses on generating gestures for two-person dialogues with specialized diffusion techniques tailored for interactive and conversational settings.


πŸŽ₯ Motion

Highlights text-constrained motion generation techniques, including MotionGPT and diffusion frameworks, for creating smooth and realistic animation sequences.

πŸ—‚ Datasets

🏷️ Name πŸ“Š Statistics πŸ” Modalities πŸ”— Link
Motion-X++ 19.5 million 3D poses across 120,500 sequences, synchronized with 80,800 RGB videos and 45,300 audio tracks. Annotated with free-form text descriptions. πŸ”· 3D/Point Cloud Data, πŸ“ Text, πŸ”Š Audio, πŸŽ₯ Video Motion-X++
HumanMM (ms-Motion) 120 long-sequence 3D motions reconstructed from 500 in-the-wild multi-shot videos, totaling 60 hours of data. Includes rare interactions. πŸ”· 3D/Point Cloud Data, πŸŽ₯ Video HumanMM
Multimodal Anatomical Motion 51,051 annotated poses with 53 anatomical landmarks, captured across 48 virtual camera views per pose. Includes 2,000+ pathological motion variations. πŸ”· 3D/Point Cloud Data, πŸ“ Text -
AMASS 11,265 motion clips aggregated from 15 mocap datasets (e.g., CMU, KIT), totaling 43 hours of motion data in SMPL format. Covers 100+ action categories. πŸ”· 3D/Point Cloud Data AMASS
HumanML3D 14,616 motion sequences (28.6 hours) paired with 44,970 free-form text descriptions spanning 200+ action categories. πŸ”· 3D/Point Cloud Data, πŸ“ Text HumanML3D
BABEL 43 hours of motion data from AMASS, annotated with 250+ verb-centric action classes across 13,220 sequences. Includes temporal action boundaries. πŸ”· 3D/Point Cloud Data, πŸ“ Text BABEL
AIST++ 1,408 dance sequences (10.1 million frames) captured from 9 camera views, totaling 15 hours of multi-view RGB video data. πŸ”· 3D/Point Cloud Data, πŸŽ₯ Video AIST++
3DPW 60 sequences (51,000 frames) captured in diverse indoor/outdoor environments, featuring challenging poses and natural object interactions. πŸ”· 3D/Point Cloud Data, πŸŽ₯ Video 3DPW
PROX 20 subjects performing 12 interactive scenarios in 3D scenes, including 180 annotated RGB frames for scene-aware motion analysis. πŸ”· 3D/Point Cloud Data, πŸ–ΌοΈ Images PROX
KIT-ML 3,911 motion clips (11.23 hours) with 6,278 natural language annotations containing 52,903 words, stored in BVH/FBX formats. πŸ”· 3D/Point Cloud Data, πŸ“ Text KIT-ML

πŸ€– Models

πŸ”€ Language-to-Pose Models

  • Language2Pose πŸ”—
    Generates 3D human poses directly from natural language. It employs two encoders (for text and 3D motion) that map inputs into a joint embedding space, and a decoder that produces a fixed-length motion sequence by minimizing the distance between corresponding text and motion embeddings.

  • MotionClip πŸ”—
    An end-to-end pipeline for motion generation based on an encoder-decoder transformer. The model extracts a high-level motion representation and uses multiple loss functionsβ€”comparing joint orientations, velocities, and image-text groundings via CLIPβ€”to enhance motion quality.

πŸ“¦ Variational Auto-Encoder (VAE) Based Models

  • ACTOR πŸ”—
    Generates diverse and realistic 3D human motions conditioned on action labels. This approach uses a transformer-based VAE to encode actions and poses into a Gaussian latent space, allowing sampling of varied motions for the same action prompt.

  • TEMOS (Text-To-Motions) πŸ”—
    A text-conditioned generative model that uses a transformer-based VAE architecture with two symmetric encoders (for motion and text). It learns a diverse latent space by aligning text and pose embeddings to generate meaningful SMPL body motions.

  • Teach πŸ”—
    Transforms sequences of text descriptions into SMPL body motions. The model operates non-autoregressively within individual actions and autoregressively across action sequences by leveraging a past-conditioned text encoder that combines historical motion features with current text input.

  • Generating Diverse and Natural 3D Human Motions from Text πŸ”—
    Generates 3D human motions from textual descriptions by first pre-training an auto-encoder (using convolutional and deconvolutional layers) and then utilizing a temporal VAE with three recurrent networks (prior, posterior, and generator) to produce motion snippets.

  • TMR πŸ”—
    Enhances transformer-based text-to-motion generation by mapping motion and text embeddings into a joint space. Dual transformer encoders are used for each modality, and cosine similarity between embeddings is maximized for positive pairs while filtering out negatives via MPNet similarity.

πŸ— VQ-VAE Based Models

  • T2M GPT πŸ”—
    The first model to apply VQ-VAE for motion generation. It learns a discrete representation (codebook) of motion and formulates motion generation as an autoregressive token prediction task, conditioned on text encoded by CLIP.

  • DiverseMotion πŸ”—
    Builds upon T2M GPT by discarding the autoregressive generation in favor of a diffusion process to diversify motion outputs. It employs CLIP for text encoding and Hierarchy Semantic Aggregation (HSA) to generate a richer holistic text embedding.

  • MoMask πŸ”—
    Uses a hierarchical VQ-VAE to quantize motion sequences into discrete tokens over multiple layers. A masked transformer predicts missing tokens (similar to BERT), and a residual transformer refines these predictions to incorporate fine motion details.

  • T2LM Long-Term 3D Human Motion πŸ”—
    Transforms sequences of text descriptions into 3D motion sequences using a 1D-convolutional VQ-VAE and a transformer-based text encoder. This method generates smooth transitions between actions, outperforming earlier techniques like TEACH.

  • MotionGPT πŸ”—
    Generates human motion from text by leveraging a pre-trained motion VQ-VAE alongside a large language model (LLM). The LLM is fine-tuned with LoRA to generate motion tokens that the VQ-VAE decoder transforms into motion sequences, significantly speeding up training.

🌈 Diffusion-Based Models

  • Flame πŸ”—
    Employs a transformer-based motion decoder within a diffusion framework. It conditions on text using cross-attention (with text embeddings from RoBERTA) and incorporates special tokens for motion length and diffusion time steps. The model is optimized with a hybrid loss combining diffusion noise loss and a variational lower bound loss, with classifier-free guidance during inference.

  • MotionDiffuse πŸ”—
    Similar to Flame but with slight architectural variations: it selects a random diffusion time step and divides the motion sequence into sub-intervals for time-varying conditioning. It utilizes efficient attention modules and optimizes using mean squared error on the noise prediction.

  • HMDM πŸ”—
    A diffusion-based model with a fixed motion sequence length that leverages CLIP’s text encoder. It introduces additional loss functions (e.g., position, foot, and velocity losses) defined on the reconstructed motion signal, rather than just the noise, to improve temporal consistency and motion fidelity.

  • Make-An-Animation πŸ”—
    Proposes a two-stage diffusion framework for text-to-3D motion generation. The model pre-trains on a large-scale static pose dataset using a UNet backbone and T5 text encoder, then fine-tunes on motion datasets, generating the entire motion sequence concurrently for improved smoothness.

  • GMD (Guided Motion Diffusion) πŸ”—
    Focuses on incorporating spatial (trajectory) constraints into the diffusion process. The method uses a two-stage pipeline that first emphasizes ground location guidance and then propagates sparse guidance gradients across neighboring frames to enhance overall motion consistency.

  • OmniControl πŸ”—
    Extends spatial guidance by cumulatively summing relative pelvis locations to infer global positions. It also introduces realism guidance, propagating control signals from keyframes and the pelvis to other joints for coherent, natural motion generation.


πŸ“¦ Object

Discusses approaches for text-to-3D object generation, such as Neural Radiance Fields (NeRFs) and 3D Gaussian Splatting, to create realistic assets.

πŸ—‚ Datasets

🏷️ Name πŸ“Š Statistics πŸ” Modalities πŸ”— Link
ShapeNet 3D models in categories like furniture and vehicles. πŸ”· 3D/Point Cloud Data, πŸ“ Text ShapeNet
BuildingNet Architectural structures for shape completion tasks. πŸ”· 3D/Point Cloud Data, πŸ“ Text BuildingNet
Text2Shape Textual descriptions linked to ShapeNet categories. πŸ“ Text, πŸ”· 3D/Point Cloud Data Text2Shape
ShapeGlot Textual utterances describing differences between shapes. πŸ“ Text, πŸ”· 3D/Point Cloud Data ShapeGlot
Pix3D 3D models aligned with real-world images for evaluation. πŸ–ΌοΈ Images, πŸ”· 3D/Point Cloud Data Pix3D
LAION-5B Large-scale dataset with 5 billion image-text pairs. πŸ–ΌοΈ Images, πŸ“ Text LAION-5B
COCO-Stuff Annotated images for real-world 3D synthesis. πŸ–ΌοΈ Images, πŸ“ Text COCO-Stuff
Flickr30K Image dataset with diverse textual descriptions. πŸ–ΌοΈ Images, πŸ“ Text Flickr30K
ModelNet40 3D CAD models across 40 object categories. πŸ”· 3D/Point Cloud Data ModelNet40
ShapeNetCore Subset of ShapeNet with detailed object models. πŸ”· 3D/Point Cloud Data, πŸ“ Text ShapeNetCore
BlendSwap Realistic 3D models with physically based rendering (PBR). πŸ”· 3D/Point Cloud Data, πŸ–ΌοΈ Images BlendSwap
InstructPix2Pix Dataset for instruction-driven image modifications. πŸ–ΌοΈ Images, πŸ“ Text InstructPix2Pix
MagicBrush Dataset for refining texture and appearance in 3D. πŸ–ΌοΈ Images MagicBrush
NeRF-Synthetic 2D images rendered from synthetic 3D scenes. πŸ–ΌοΈ Images NeRF-Synthetic
ScanNet 2.5M RGB-D views with semantic segmentations and camera poses. πŸ–ΌοΈ Images, πŸ“ Text ScanNet
Matterport3D 10,800 panoramic views from 90 building-scale scenes. πŸ–ΌοΈ Images, πŸ”· 3D/Point Cloud Data, πŸ“ Text Matterport3D

πŸ€– Models


🧡 Texture

Focuses on methods for generating detailed surface textures that enhance the realism of 3D models, including text-guided synthesis and neural rendering techniques.

πŸ—‚ Datasets

🏷️ Name πŸ“Š Statistics πŸ” Modalities πŸ”— Link
3D-FUTURE 9,992 detailed 3D furniture models with high-res textures; 20,240 synthetic images across 5,000 scenes. πŸ”· 3D Geometry, πŸ–ΌοΈ Texture, πŸ–ΌοΈ 2D Images 3D-FUTURE
Objaverse Over 800 K textured 3D models with natural-language descriptions across diverse categories. πŸ”· 3D Geometry, πŸ–ΌοΈ Texture, πŸ“ Language Objaverse
ShapeNet Large-scale structured 3D meshes (incl. 300 car models) used for texture benchmarking. πŸ”· 3D Geometry, πŸ–ΌοΈ Texture ShapeNet
ShapeNetSem Semantic extension of ShapeNet with 445 annotated meshes for structure-aware evaluation. πŸ”· 3D Geometry, πŸ–ΌοΈ Texture ShapeNetSem
ModelNet40 40-category CAD benchmark for generalization testing in geometry-aware texture generation. πŸ”· 3D Geometry ModelNet40
Sketchfab Repository of commercial and scanned 3D models for qualitative texture evaluation. πŸ”· 3D Geometry, πŸ–ΌοΈ Texture Sketchfab
CGTrader High-res 3D assets for mesh diversity in text-driven synthesis. πŸ”· 3D Geometry, πŸ–ΌοΈ Texture CGTrader
TurboSquid Commercial dataset of detailed assets and fine-surface textures for high-fidelity evaluations. πŸ”· 3D Geometry, πŸ–ΌοΈ Texture TurboSquid
RenderPeople High-quality human scans with detailed anatomy and surface properties for text-to-texture testing. πŸ”· 3D Scans RenderPeople
Tripleganger Scanned high-fidelity human models for evaluating facial and clothing texture realism. πŸ”· 3D Scans Tripleganger
Stanford 3D Scans High-resolution object scans for generalization tests on real-world geometries. πŸ”· 3D Scans Stanford 3D Scans
ElBa 30 K synthetic texture images with 3 M texel-level annotations for element-based analysis. πŸ–ΌοΈ 2D Texture, πŸ“ Attributes & Layout ElBa

πŸ€– Models

  • CLIP-Pseudo Inpainting πŸ”—
    Pioneering masked-inpainting pipeline using CLIP pseudo-captioning to semantically align 2D renderings with 3D geometry without paired text data.
    arXiv:2303.13273

  • Text2Tex πŸ”—
    Two-stage diffusion: Stage I generates initial textures via depth-to-image denoising; Stage II back-projects and refines them in UV space by selecting extra views to correct artifacts.
    arXiv:2303.11396

  • TEXTure πŸ”—
    Inpainting-based diffusion with trip-based surface segmentation to generate, refine, or preserve regions, ensuring smooth transitions and efficient passes.
    Project page

  • Paint-it πŸ”—
    Integrates PBR rendering and U-Net reparameterization with CLIP-guided Score Distillation Sampling for high-fidelity mesh texturing, at the cost of per-model optimization time.
    arXiv:2312.11360

  • Point-UV Diffusion πŸ”—
    Coarse-to-fine pipeline: initial mesh-surface painting then 2D UV diffusion refinement, decoupling global structure generation from fine-detail synthesis.
    ICCV 2023 paper

  • TexPainter πŸ”—
    Latent diffusion in color-space embeddings using depth-conditioned DDIM sampling across fixed viewpoints, aggregated into a unified texture map.
    arXiv:2406.18539

  • TexFusion πŸ”—
    Sequential Interlaced Multiview Sampler fuses multi-view latent features during diffusion, reducing inference time while preserving cross-view coherence.
    arXiv:2310.13772

  • GenesisTex πŸ”—
    Cross-view attention during diffusion followed by Img2Img post-processing to eliminate seams and enhance surface detail in UV maps.
    arXiv:2403.17782

  • ConsistencyΒ² πŸ”—
    Latent Consistency Models that achieve fast, multi-view coherent textures with just four denoising steps, disentangling noise and color paths.
    arXiv:2406.11202

  • Meta 3D TextureGen πŸ”—
    Two-stage: geometry-aware diffusion produces multi-view images; incidence-aware UV inpainting and patch upscaling yield seamless 4K textures.
    Meta Research

  • VCD-Texture πŸ”—
    Variance Alignment with joint noise prediction and multi-view aggregation modules to maintain statistical feature consistency across views.
    arXiv:2407.04461


πŸ”— Citations

If you find our paper or repository useful, please cite the paper:

@article{abootorabi2025generative,
  title={Generative AI for Character Animation: A Comprehensive Survey of Techniques, Applications, and Future Directions},
  author={Abootorabi, Mohammad Mahdi and Ghahroodi, Omid and Zahraei, Pardis Sadat and Behzadasl, Hossein and Mirrokni, Alireza and Salimipanah, Mobina and Rasouli, Arash and Behzadipour, Bahar and Azarnoush, Sara and Maleki, Benyamin and others},
  journal={arXiv preprint arXiv:2504.19056},
  year={2025}
}

πŸ“§ Contact

If you have questions, please send an email to mahdi.abootorabi2@gmail.com.

⭐ Star History

Star History Chart

About

Generative AI for Character Animation: A Comprehensive Survey of Techniques, Applications, and Future Directions

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •