We are the Alibaba AMAP CV Lab, focusing on cutting-edge research and innovative applications centered around computer vision technology. We are dedicated to building core technological capabilities in the field of spatiotemporal internet. The Alibaba AMAP CV Lab is always at the forefront of innovation in computer vision research and applications, making it a key practitioner of technology in the field of Alibabaโs spatial intelligent internet. The Alibaba AMAP CV team is located at the intersection of the physical and digital worlds, empowering smart mobility and daily life with AI. As a core technical driver within Amap, our team pioneers:
- Next-Generation 3D Map Engines
- Multimodal Understanding & Generation
- Spatial Intelligence
- World Modeling
We welcome contributions, issues, and feedback!
Feel free to โญ the repos below to stay updated.
- ๐ Jul 05, 2025 โ Our paper FantasyTalking is accepted by ACM MM 2025.
- ๐ Jun 26, 2025 โ Our paper SeqGrowGraph is accepted by ICCV 2025.
- ๐ข May 23, 2025 โ We released the full project of FSDrive.
- ๐ Apr 29, 2025 โ Our paper G3PT is accepted by IJCAI 2025.
- ๐ข Apr 28, 2025 โ We released the inference code and model weights of FantasyTalking.
- ๐ข Apr 24, 2025 โ We released the inference code and model weights of FantasyID.
Next-generation engine for real-time rendering and updating of large-scale 3D maps with high-level accuracy.
A generative framework that reframes lane network learning as a process of incrementally building an adjacency matrix.
๐ Driving by the Rules: A Benchmark for Integrating Traffic Sign Regulations into Vectorized HD Map
Benchmark and multi-modal approach for integrating lane-level traffic sign regulations into vectorized HD maps.
Framework for spatial reasoning and path planning in autonomous navigation and robotics.
The first VLA for autonomous driving visual reasoning, which proposes spatio-temporal CoT to think visually about trajectory planning and unifies visual generation and understanding with minimal data.
Toolkit for unified understanding and generation across text, image, video, audio and spatial data.
The first native 3D generation foundational model based on next-scale autoregression.
Performance analysis and novel solution exploration for speech recognition under synthetic speech interference.
The human related AIGC model family, more are coming soon. Please check out our Fantasy AIGC Family for more details.
๐คก FantasyPortrait: Enhancing Multi-Character Portrait Animation with Expression-Augmented Diffusion Transformers
A novel expression-driven video-generation method that pairs emotion-enhanced learning with masked cross-attention, enabling the creation of high-quality, richly expressive animations for both single and multi-portrait scenarios.
The first Wan based high-fidelity audio-driven avatar system that synchronizes facial expressions, lip motion, and body gestures in dynamic scenes.
A tuning-free text-to-video model that leverages 3D facial priors, multi-view augmentation, and layer-aware guidance injection to deliver dynamic, identity-preserving video generation.
The first dataset for automatic rigging of 3D generated digital humans and a transformer-based end-to-end automatic rigging algorithm.
Platform for constructing and querying dynamic digital twins of real-world environments.
๐ Coming soon
An optimization scheme for a proprietary HPE task in DMS scenarios which involves a pose-wise hard mining strategy for distribution balance and an online keypoint-aligned Grad-CAM loss to constrain activations to semantic regions.
A self-correction mean teacher architecture that mitigates the impact of noisy pseudo-labels, offering a novel technological breakthrough in the field of semi-supervised object detection.