😎 Awesome VLA for Autonomous Driving

This survey reviews vision-action (VA) models and vision-language-action (VLA) models for autonomous driving.

Model	Paper	Venue	Website	GitHub

`LBC`	Learning by Cheating	CoRL 2020	-
`Latent-DRL`	End-to-End Model-Free Reinforcement Learning for Urban Driving using Implicit Affordances	CVPR 2020	-	-
`NEAT`	NEAT: Neural Attention Fields for End-to-End Autonomous Driving	ICCV 2021	-
`Roach`	End-to-End Urban Driving by Imitating a Reinforcement Learning Coach	ICCV 2021
`WoR`	Learning to Drive from A World on Rails	ICCV 2021
`TCP`	Trajectory-guided Control Prediction for End-to-end Autonomous Driving: A Simple yet Strong Baseline	NeurIPS 2022	-
`Urban-Driver`	Urban Driver: Learning to Drive from Real-world Demonstrations Using Policy Gradients	CoRL 2022
`LAV`	Learning from All Vehicles	CVPR 2022
`TransFuser`	TransFuser: Imitation with Transformer-Based Sensor Fusion for Autonomous Driving	TPAMI 2023	-
`GRI`	GRI: General Reinforced Imitation and its Application to Vision-Based Autonomous Driving	Robotics 2023	-	-
`BEVPlanner`	Is Ego Status All You Need for Open-Loop End-to-End Autonomous Driving?	CVPR 2024	-
`Raw2Drive`	Raw2Drive: Reinforcement Learning with Aligned World Models for End-to-End Autonomous Driving (in CARLA v2)	arXiv 2025	-	-
`GoalFlow`	GoalFlow: Goal-Driven Flow Matching for Multimodal Trajectories Generation in End-to-End Autonomous Driving	CVPR 2025
`RAD`	RAD: Training an End-to-End Driving Policy via Large-Scale 3DGS-based Reinforcement Learning	arXiv 2025		-

2️⃣ Perception-Action Models

⏲️ In chronological order, from the earliest to the latest.

Model	Paper	Venue	Website	GitHub

`ST-P3`	ST-P3: End-to-end Vision-based Autonomous Driving via Spatial-Temporal Feature Learning	ECCV 2022	-
`UniAD`	Planning-oriented Autonomous Driving	CVPR 2023	-
`VAD`	VAD: Vectorized Scene Representation for Efficient Autonomous Driving	ICCV 2023	-
`OccNet`	Scene as Occupancy	ICCV 2023	-
`GenAD`	GenAD: Generative End-to-End Autonomous Driving	ECCV 2024	-
`PARA-Drive`	PARA-Drive: Parallelized Architecture for Real-time Autonomous Driving	CVPR 2024		-
`SparseAD`	SparseAD: Sparse Query-Centric Paradigm for Efficient End-to-End Autonomous Driving	arXiv 2024	-	-
`GaussianAD`	GaussianAD: Gaussian-Centric End-to-End Autonomous Driving	arXiv 2024	-	-
`DiFSD`	DiFSD: Ego-Centric Fully Sparse Paradigm with Uncertainty Denoising and Iterative Refinement for Efficient End-to-End Self-Driving	arXiv 2024	-
`DriveTransformer`	DriveTransformer: Unified Transformer for Scalable End-to-End Autonomous Driving	ICLR 2025	-
`SparseDrive`	SparseDrive: End-to-End Autonomous Driving via Sparse Scene Representation	ICRA 2025	-
`DiffusionDrive`	DiffusionDrive: Truncated Diffusion Model for End-to-End Autonomous Driving	CVPR 2025	-

3️⃣ Image-based World Models

⏲️ In chronological order, from the earliest to the latest.

Model	Paper	Venue	Website	GitHub

`DriveDreamer`	DriveDreamer: Towards Real-world-driven World Models for Autonomous Driving	ECCV 2024
`GenAD`	GenAD: Generalized Predictive Model for Autonomous Driving	CVPR 2024	-
`Drive-WM`	Driving into the Future: Multiview Visual Forecasting and Planning with World Model for Autonomous Driving	CVPR 2024
`DrivingWorld`	DrivingWorld: Constructing World Model for Autonomous Driving via Video GPT	arXiv 2024
`Imagine-2-Drive`	Imagine-2-Drive: Leveraging High-Fidelity World Models via Multi-Modal Diffusion Policies	IROS 2025		-
`DrivingGPT`	DrivingGPT: Unifying Driving World Modeling and Planning with Multi-modal Autoregressive Transformers	arXiv 2024		-
`Epona`	Epona: Autoregressive Diffusion World Model for Autonomous Driving	ICCV 2025

4️⃣ Occupancy-based World Models

⏲️ In chronological order, from the earliest to the latest.

Model	Paper	Venue	Website	GitHub

`OccWorld`	OccWorld: Learning a 3D Occupancy World Model for Autonomous Driving	ECCV 2024
`NeMo`	Neural Volumetric World Models for Autonomous Driving	ECCV 2024	-	-
`OccVAR`	OCCVAR: Scalable 4D Occupancy Prediction via Next-Scale Prediction	OpenReview 2024	-	-
`RenderWorld`	RenderWorld: World Model with Self-Supervised 3D Label	arXiv 2024	-	-
`DFIT-OccWorld`	An Efficient Occupancy World Model via Decoupled Dynamic Flow and Image-assisted Training	arXiv 2024	-	-
`Drive-OccWorld`	Driving in the Occupancy World: Vision-Centric 4D Occupancy Forecasting and Planning via World Models for Autonomous Driving	AAAI 2025
`T³Former`	Temporal Triplane Transformers as Occupancy World Models	arXiv 2025	-	-

5️⃣ Latent-based World Models

⏲️ In chronological order, from the earliest to the latest.

Model	Paper	Venue	Website	GitHub

`Covariate-Shift`	Mitigating Covariate Shift in Imitation Learning for Autonomous Vehicles Using Latent Space Generative World Models	arXiv 2024	-	-
`World4Drive`	World4Drive: End-to-End Autonomous Driving via Intention-aware Physical Latent World Model	ICCV 2025	-	-
`WoTE`	End-to-End Driving with Online Trajectory Evaluation via BEV World Model	ICCV 2025	-
`LAW`	Enhancing End-to-End Autonomous Driving with Latent World Model	ICLR 2025	-
`SSR`	Navigation-Guided Sparse Scene Representation for End-to-End Autonomous Driving	ICLR 2025	-
`Echo-Planning`	Echo Planning for Autonomous Driving: From Current Observations to Future Trajectories and Back	arXiv 2025	-	-
`SeerDrive`	Future-Aware End-to-End Driving: Bidirectional Modeling of Trajectory Planning and Scene Evolution	NeurIPS 2025	-

2. Vision-Language-Action-Models

⏲️ In chronological order, from the earliest to the latest.

1️⃣ Textual Action Generator

Model	Paper	Venue	Website	GitHub

`DriveMLM`	DriveMLM: Aligning Multi-Modal Large Language Models with Behavioral Planning States for Autonomous Driving	arXiv 2023	-
`DriveLikeAHuman`	Drive Like a Human: Rethinking Autonomous Driving with Large Language Models	arXiv 2023
`GPT-Driver`	GPT-Driver: Learning to Drive with GPT	NeurIPSW 2023
`RAG-Driver`	RAG-Driver: Generalisable Driving Explanations with Retrieval-Augmented In-Context Learning in Multi-Modal Large Language Model	RSS 2024
`RDA-Driver`	Making Large Language Models Better Planners with Reasoning-Decision Alignment	ECCV 2024	-	-
`DriveLM`	DriveLM: Driving with Graph Visual Question Answering	ECCV 2024
`DriveGPT4`	DriveGPT4: Interpretable End-to-end Autonomous Driving via Large Language Model	RA-L 2024		-
`DriVLMe`	DriVLMe: Enhancing LLM-based Autonomous Driving Agents with Embodied and Social Experience	IROS 2024
`LLaDA`	Driving Everywhere with Large Language Model Policy Adaptation	CVPR 2024
`VLAAD`	VLAAD: Vision and Language Assistant for Autonomous Driving	WACVW 2024	-
`OccLLaMA`	OccLLaMA: A Unified Occupancy-Language-Action World Model for Understanding and Generation Tasks in Autonomous Driving	arXiv 2024		-
`Doe-1`	Doe-1: Closed-Loop Autonomous Driving with Large World Model	arXiv 2024
`SafeAuto`	SafeAuto: Knowledge-Enhanced Safe Autonomous Driving with Multimodal Foundation Models	ICML 2025	-
`OpenEMMA`	OpenEMMA: Open-Source Multimodal Model for End-to-End Autonomous Driving	arXiv 2024	-
`ReasonPlan`	ReasonPlan: Unified Scene Prediction and Decision Reasoning for Closed-loop Autonomous Driving	CoRL 2025	-
`FutureSightDrive`	FutureSightDrive: Thinking Visually with Spatio-Temporal CoT for Autonomous Driving	NeurIPS 2025
`WKER`	World knowledge-enhanced Reasoning Using Instruction-guided Interactor in Autonomous Driving	AAAI 2025	-	-
`OmniDrive`	OmniDrive: A Holistic LLM-Agent Framework for Autonomous Driving with 3D Perception, Reasoning and Planning	CVPR 2025	-
`EMMA`	EMMA: End-to-End Multimodal Model for Autonomous Driving	arXiv 2024		-
`AlphaDrive`	AlphaDrive: Unleashing the Power of VLMs in Autonomous Driving via Reinforcement Learning and Reasoning	arXiv 2025	-
`Occ-LLM`	Occ-LLM: Enhancing Autonomous Driving with Occupancy-BasedLarge Language Models	arXiv 2025	-	-
`DriveAgent-R1`	DriveAgent-R1: Advancing VLM-based Autonomous Driving with Hybrid Thinking and Active Perception	arXiv 2025	-	-
`Sce2DriveX`	Sce2DriveX: A Generalized MLLM Framework for Scene-to-Drive Learning	arXiv 2025	-	-
`Drive-R1`	Drive-R1: Bridging Reasoning and Planning in VLMs for Autonomous Driving with Reinforcement Learning	arXiv 2025	-	-
`FastDriveVLA`	FastDriveVLA: Efficient End-to-End Driving via Plug-and-Play Reconstruction-based Token Pruning	arXiv 2025	-	-
`Wisead`	WiseAD: Knowledge Augmented End-to-End Autonomous Driving with Vision-Language Model	arXiv 2024
`ImpromptuVLA`	Impromptu VLA: Open Weights and Open Data for Driving Vision-Language-Action Models	arXiv 2025
`LightEMMA`	LightEMMA: Lightweight End-to-End Multimodal Model for Autonomous Driving	arXiv 2025	-
`AutoDrive-R²`	AutoDrive-R²: Incentivizing Reasoning and Self-Reflection Capacity for VLA Model in Autonomous Driving	arXiv 2025	-	-
`OmniReason`	OmniReason: A Temporal-Guided Vision-Language-Action Framework for Autonomous Driving	arXiv 2025	-	-

2️⃣ Numerical Action Generator

⏲️ In chronological order, from the earliest to the latest.

Model	Paper	Venue	Website	GitHub

`LMDrive`	LMDrive: Closed-Loop End-to-End Driving with Large Language Models	CVPR 2024
`LINGO-2`	- LINGO-2: Driving with Natural Language	2024		-
`CoVLA-Agent`	CoVLA: Comprehensive Vision-Language-Action Dataset for Autonomous Driving	WACV 2025		-
`ORION`	ORION: A Holistic End-to-End Autonomous Driving Framework by Vision-Language Instructed Action Generation	ICCV 2025
`AutoVLA`	AutoVLA: A Vision-Language-Action Model for End-to-End Autonomous Driving with Adaptive Reasoning and Reinforcement Fine-Tuning	NeurIPS 2025
`SimLingo`	SimLingo: Vision-Only Closed-Loop Autonomous Driving with Language-Action Alignment	CVPR 2025
`DriveGPT4-V2`	DriveGPT4-V2: Harnessing Large Language Model Capabilities for Enhanced Closed-Loop Autonomous Driving	CVPR 2025	-	-
`S4-Driver`	S4-Driver: Scalable Self-Supervised Driving Multimodal Large Language Model with Spatio-Temporal Visual Representation	CVPR 2025		-
`VaViM`	VaViM and VaVAM: Autonomous Driving through Video Generative Modeling	arXiv 2025
`BEVDriver`	BEVDriver: Leveraging BEV Maps in LLMs for Robust Closed-Loop Driving	arXiv 2025	-	-
`DriveMoE`	DriveMoE: Mixture-of-Experts for Vision-Language-Action Model in End-to-End Autonomous Driving	arXiv 2025
`DSDrive`	DSDrive: Distilling Large Language Model for Lightweight End-to-End Autonomous Driving with Unified Reasoning and Planning	arXiv 2025	-	-
`OpenDriveVLA`	OpenDriveVLA: Towards End-to-end Autonomous Driving with Large Vision Language Action Model	arXiv 2025
`PLA`	A Unified Perception-Language-Action Framework for Adaptive Autonomous Driving	arXiv 2025	-	-
`OccVLA`	OccVLA: Vision-Language-Action Model with Implicit 3D Occupancy Supervision	arXiv 2025	-	-

3️⃣ Explicit Action Guidance

⏲️ In chronological order, from the earliest to the latest.

Model	Paper	Venue	Website	GitHub

`DriveVLM`	DriveVLM: The convergence of autonomous driving and large vision-language models	CoRL 2024		-
`DualAD`	DualAD: Dual-Layer Planning for Reasoning in Autonomous Driving	IROS 2025
`FasionAD`	FASIONAD : FAst and Slow FusION Thinking Systems for Human-Like Autonomous Driving with Adaptive Feedback	arXiv 2024	-	-
`Senna`	Senna: Bridging Large Vision-Language Models and End-to-End Autonomous Driving	arXiv 2024	-
`LeapAD`	Continuously Learning, Adapting, and Improving: A Dual-Process Approach to Autonomous Driving	NeurIPS 2024
`DME-Driver`	DME-Driver: Integrating human decision logic and 3D scene perception in autonomous driving	AAAI 2025	-	-
`SOLVE`	SOLVE: Synergy of Language-Vision and End-to-End Networks for Autonomous Driving	CVPR 2025	-	-
`FasionAD++`	FASIONAD++ : Integrating High-Level Instruction and Information Bottleneck in FAt-Slow fusION Systems for Enhanced Safety in Autonomous Driving with Adaptive Feedback	arXiv 2025	-	-
`LeapVAD`	LeapVAD: A Leap in Autonomous Driving via Cognitive Perception and Dual-Process Thinking	arXiv 2025	-	-
`DiffVLA`	DiffVLA: Vision-language guided diffusion planning for autonomous driving	arXiv 2025	-	-
`ReAL-AD`	ReAL-AD: Towards Human-Like Reasoning in End-to-End Autonomous Driving	arXiv 2025		-

4️⃣ Implicit Representations Transfer

⏲️ In chronological order, from the earliest to the latest.

Model	Paper	Venue	Website	GitHub

`VLP`	VLP: Vision Language Planning for Autonomous Driving	CVPR 2024	-	-
`VLM-AD`	VLM-AD: End-to-End Autonomous Driving through Vision-Language Model Supervision	CoRL 2025	-	-
`DiMA`	Distilling Multi-modal Large Language Models for Autonomous Driving	CVPR 2025	-	-
`ALN-P3`	ALN-P3: Unified Language Alignment for Perception, Prediction, and Planning in Autonomous Driving	arXiv 2025	-	-
`VERDI`	VERDI: VLM-Embedded Reasoning for Autonomous Driving	arXiv 2025	-	-
`VLM-E2E`	VLM-E2E: Enhancing End-to-End Autonomous Driving with Multimodal Driver Attention Fusion	arXiv 2025	-	-
`ReCogDrive`	ReCogDrive: A Reinforced Cognitive Framework for End-to-End Autonomous Driving	arXiv 2025
`InsightDrive`	InsightDrive: Insight Scene Representation for End-to-End Autonomous Driving	arXiv 2025	-
`NetRoller`	NetRoller: Interfacing General and Specialized Models for End-to-End Autonomous Driving	arXiv 2025	-
`ETA`	ETA: Efficiency through Thinking Ahead, A Dual Approach to Self-Driving with Large Models	arXiv 2025	-
`ViLaD`	ViLaD: A Large Vision Language Diffusion Framework for End-to-End Autonomous Driving	arXiv 2025	-	-

3. Datasets & Benchmarks

⏲️ In chronological order, from the earliest to the latest.

1️⃣ Vision-Action Datasets

Dataset	Paper	Venue	Website	GitHub

`BDD100K`	BDD100K: A diverse driving dataset for heterogeneous multitask learning	CVPR 2020
`nuScenes`	nuScenes: A multimodal dataset for autonomous driving	CVPR 2020		-
`Waymo`	Scalability in perception for autonomous driving: Waymo open dataset	CVPR 2020
`nuPlan`	nuPlan: A closed-loop ML-based planning benchmark for autonomous vehicles	arXiv 2021
`Argoverse 2`	Argoverse 2: Next generation datasets for self-driving perception and forecasting	NeurIPS 2021
`Bench2Drive`	Bench2Drive: Towards multi-ability benchmarking of closed-loop end-to-end autonomous driving	NeurIPS 2024	-
`RoboBEV`	Benchmarking and improving bird's eye view perception robustness in autonomous driving	TPAMI 2025	-

2️⃣ Vision-Language-Action Datasets

Dataset	Paper	Venue	Website	GitHub

`BDD-X`	Textual explanations for self-driving vehicles	ECCV 2018	-
`Talk2Car`	Talk2Car: Predicting physical trajectories for natural language commands	IEEE Access 2022	-
`SDN`	DOROTHIE: Spoken Dialogue for Handling Unexpected Situations in Interactive Autonomous Driving Agents	EMNLP 2022	-
`DriveMLM`	DriveMLM: Aligning Multi-Modal Large Language Models with Behavioral Planning States for Autonomous Driving	arXiv 2023	-
`LMDrive`	LMDrive: Closed-Loop End-to-End Driving with Large Language Models	CVPR 2024
`DriveLM-nuScenes`	DriveLM: Driving with graph visual question answering	ECCV 2024
`DriveLM-CARLA`	DriveLM: Driving with graph visual question answering	ECCV 2024
`HBD`	DME-Driver: Integrating human decision logic and 3D scene perception in autonomous driving	AAAI 2025	-	-
`VLAAD`	VLAAD: Vision and Language Assistant for Autonomous Driving	WACVW 2024	-
`SUP-AD`	DriveVLM: The convergence of autonomous driving and large vision-language models	CoRL 2024		-
`NuInstruct`	Holistic Autonomous Driving Understanding by Bird's-Eye-View Injected Multi-Modal Large Models	CVPR 2024	-
`WOMD-Reasoning`	WOMD-Reasoning: A large-scale dataset for interaction reasoning in driving	ICML 2025
`DriveCoT`	DriveCoT: Integrating chain-of-thought reasoning with end-to-end driving	arXiv 2024		-
`Reason2Drive`	Reason2Drive: Towards interpretable and chain-based reasoning for autonomous driving	ECCV 2024	-
`DriveBench`	Are VLMs ready for autonomous driving? an empirical study from the reliability, data, and metric perspectives	ICCV 2025
`MetaAD`	AlphaDrive: Unleashing the Power of VLMs in Autonomous Driving via Reinforcement Learning and Reasoning	arXiv 2025
`OmniDrive`	OmniDrive: A Holistic LLM-Agent Framework for Autonomous Driving with 3D Perception, Reasoning and Planning	CVPR 2025	-
`NuInteract`	Extending Large Vision-Language Model for Diverse Interactive Tasks in Autonomous Driving	arXiv 2025	-	-
`DriveAction`	DriveAction: A Benchmark for Exploring Human-like Driving Decisions in VLA Models	arXiv 2025	-	-
`ImpromptuVLA`	Impromptu VLA: Open weights and open data for driving vision-language-action models	arXiv 2025
`CoVLA`	CoVLA: Comprehensive vision-language-action dataset for autonomous driving	WACV 2025		-
`OmniReason-nuScenes`	OmniReason: A Temporal-Guided Vision-Language-Action Framework for Autonomous Driving	arXiv 2025	-	-
`OmniReason-B2D`	OmniReason: A Temporal-Guided Vision-Language-Action Framework for Autonomous Driving	arXiv 2025	-	-