For trajectory_guidance, please check my repository trajectory_guidance
This project tackles the challenge of generating detailed and context-rich 3D human motions from textual descriptions beyond standard training data. The framework integrates a multi-agent system—powered by large language models and a vision-language module—to segment, synthesize, and refine motion outputs in an iterative loop. By employing a mask-transformer architecture with body part-specific encoders and codebooks, we achieve granular control over both short and extended motion sequences. After initial generation, an automated review process uses video-based captioning to identify discrepancies and generate corrective instructions, allowing each body region to be accurately adjusted. Experimental results on the HumanML3D benchmark demonstrate that this approach not only attains competitive performance against recent methods but excels in handling long-form prompts and multi-step motion compositions. Comprehensive user studies further indicate significant improvements in realism and fidelity for complex scenarios.
A person does Bruce Lee's classic kicks, and runs forward with right arm extending forward, and trying to avoid sphere obstacles in his way.
Bruce.Lee.s.classic.kicks.mp4
A woman picks up speed from a walk to a run, holding the T-pose.
T-pose.mp4
A person sits on the floor with hands resting on their knees, then reaches forward with their right arm trying to grab something.
grab.mp4
An angry midfielder performs a slide tackle on another player.
slide.tackle.mp4
An illustrative example of the workflow:
Sincerely thank the open-sourcing of these works where the code is based on: momask-codes, deep-motion-editing, Muse, vector-quantize-pytorch, T2M-GPT, MDM and MLD