Skip to content
@amap-cvlab

Alibaba AMAP CV Lab

Leading computer vision innovation, powering smart mobility and AI-driven spatiotemporal technologies.

Alibaba AMAP CV Lab

ไธญๆ–‡้˜…่ฏป

๐Ÿ‘‹ About Us

We are the Alibaba AMAP CV Lab, focusing on cutting-edge research and innovative applications centered around computer vision technology. We are dedicated to building core technological capabilities in the field of spatiotemporal internet. The Alibaba AMAP CV Lab is always at the forefront of innovation in computer vision research and applications, making it a key practitioner of technology in the field of Alibabaโ€™s spatial intelligent internet. The Alibaba AMAP CV team is located at the intersection of the physical and digital worlds, empowering smart mobility and daily life with AI. As a core technical driver within Amap, our team pioneers:

  • Next-Generation 3D Map Engines
  • Multimodal Understanding & Generation
  • Spatial Intelligence
  • World Modeling

We welcome contributions, issues, and feedback!
Feel free to โญ the repos below to stay updated.

๐Ÿ”ˆ Latest News

  • ๐Ÿ› Jul 05, 2025 โ€“ Our paper FantasyTalking is accepted by ACM MM 2025.
  • ๐Ÿ› Jun 26, 2025 โ€“ Our paper SeqGrowGraph is accepted by ICCV 2025.
  • ๐Ÿ“ข May 23, 2025 โ€“ We released the full project of FSDrive.
  • ๐Ÿ› Apr 29, 2025 โ€“ Our paper G3PT is accepted by IJCAI 2025.
  • ๐Ÿ“ข Apr 28, 2025 โ€“ We released the inference code and model weights of FantasyTalking.
  • ๐Ÿ“ข Apr 24, 2025 โ€“ We released the inference code and model weights of FantasyID.

๐Ÿ”ง Public Technologies

๐Ÿ—บ๏ธ 3D Map Engine

Next-generation engine for real-time rendering and updating of large-scale 3D maps with high-level accuracy.

๐Ÿ“‘ SeqGrowGraph: Learning Lane Topology as a Chain of Graph Expansions

arXiv Publish

A generative framework that reframes lane network learning as a process of incrementally building an adjacency matrix.

๐Ÿ“‘ Driving by the Rules: A Benchmark for Integrating Traffic Sign Regulations into Vectorized HD Map

Home Page arXiv Publish

Benchmark and multi-modal approach for integrating lane-level traffic sign regulations into vectorized HD maps.

๐Ÿ“‘ Global-Guided Focal Neural Radiance Field for Large-Scale Scene Representation

Home Page arXiv Publish

๐Ÿ“ Spatial Intelligence

Framework for spatial reasoning and path planning in autonomous navigation and robotics.

๐Ÿ“‘ FutureSightDrive: Thinking Visually with Spatio-Temporal CoT for Autonomous Driving

The first VLA for autonomous driving visual reasoning, which proposes spatio-temporal CoT to think visually about trajectory planning and unifies visual generation and understanding with minimal data.

Home Page arXiv Code

๐ŸŒˆ Multimodal AI

Toolkit for unified understanding and generation across text, image, video, audio and spatial data.

๐Ÿ“‘ G3PT: Unleash the Power of Autoregressive Modeling in 3D Generative Tasks

arXiv Publish

The first native 3D generation foundational model based on next-scale autoregression.

๐Ÿ“‘ A Study on the Adverse Impact of Synthetic Speech on Speech Recognition

Publish

Performance analysis and novel solution exploration for speech recognition under synthetic speech interference.

๐Ÿค– Human AIGC

The human related AIGC model family, more are coming soon. Please check out our Fantasy AIGC Family for more details.

๐Ÿคก FantasyPortrait: Enhancing Multi-Character Portrait Animation with Expression-Augmented Diffusion Transformers

Home Page arXiv Code

A novel expression-driven video-generation method that pairs emotion-enhanced learning with masked cross-attention, enabling the creation of high-quality, richly expressive animations for both single and multi-portrait scenarios.

๐Ÿ—ฃ๏ธ FantasyTalking: Realistic Talking Portrait Generation via Coherent Motion Synthesis

Home Page arXiv Publish HuggingFace HuggingFace ModelScope Code

The first Wan based high-fidelity audio-driven avatar system that synchronizes facial expressions, lip motion, and body gestures in dynamic scenes.

๐Ÿ†” FantasyID: Face Knowledge Enhanced ID-Preserving Video Generation

Home Page arXiv HuggingFace ModelScope Code

A tuning-free text-to-video model that leverages 3D facial priors, multi-view augmentation, and layer-aware guidance injection to deliver dynamic, identity-preserving video generation.

๐Ÿ“‘ HumanRig: Learning Automatic Rigging for Humanoid Characters in Animation

Home Page arXiv Publish Code HuggingFace

The first dataset for automatic rigging of 3D generated digital humans and a transformer-based end-to-end automatic rigging algorithm.

๐ŸŒ World Modeling

Platform for constructing and querying dynamic digital twins of real-world environments.
๐Ÿ”œ Coming soon

๐Ÿ’ก Others

๐Ÿ“‘ DPOSE: Online Keypoint-CAM Guided Inference for Driver Pose Estimation

Publish
An optimization scheme for a proprietary HPE task in DMS scenarios which involves a pose-wise hard mining strategy for distribution balance and an online keypoint-aligned Grad-CAM loss to constrain activations to semantic regions.

๐Ÿค– Doubly-Fused ViT: Fuse Information from Dual Vision Transformer Streams

Publish Code

๐Ÿ“‘ SCMT: Self-Correction Mean Teacher for Semi-supervised Object Detection

Publish

A self-correction mean teacher architecture that mitigates the impact of noisy pseudo-labels, offering a novel technological breakthrough in the field of semi-supervised object detection.

Popular repositories Loading

  1. MV-Painter MV-Painter Public

    Python 261 28

  2. .github .github Public

Repositories

Showing 2 of 2 repositories

Top languages

Loadingโ€ฆ

Most used topics

Loadingโ€ฆ