Skip to content

A comprehensive list of awesome research, resources, and tools for leveraging LLMs, VLMs, and VLA models in autonomous driving decision-making and motion planning.

Notifications You must be signed in to change notification settings

Jiaaqiliu/Awesome-LM-AD-Decision

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

66 Commits
 
 
 
 

Repository files navigation

Awesome LM for Autonomous Driving Decision-Making

A comprehensive list of awesome research, resources, and tools for leveraging Large Language Models (LLMs), Vision-Language Models (VLMs), and Vision-Language-Action (VLA) models in autonomous driving decision-making and motion planning. Contributions are welcome!

Table of Contents

📚Survey Papers

Title Year Categories Project
End-to-end Autonomous Driving: Challenges and Frontiers TPAMI 2025 End to End Project
On the Prospects of Incorporating Large Language Models (LLMs) in Automated Planning and Scheduling ICAPS 2024 Large Language Models Project
Llm4drive: A survey of large language models for autonomous driving Arxiv 2023 Large Language Models Project
Prospective role of foundation models in advancing autonomous vehicles Research 2024 Foundation Models
Large models for intelligent transportation systems and autonomous vehicles: A survey AEI 2024 Foundation Models
Vision language models in autonomous driving: A survey and outlook IV 2024 Foundation Models Project
A Survey on Multimodal Large Language Models for Autonomous Driving WACV 2024 Foundation Models
A survey for foundation models in autonomous driving Arxiv 2024 Foundation Models
Foundation Models for Decision Making: Problems, Methods, and Opportunities Arxiv 2023 Foundation Models
Forging Vision Foundation Models for Autonomous Driving: Challenges, Methodologies, and Opportunities Arxiv 2024 Foundation Models
Generative AI for Autonomous Driving: Frontiers and Opportunities Arxiv 2025 Generative AI Project

📄Research Papers

Categorization by Model Type

This section further elaborates on papers based on the primary type of foundation model employed for decision-making.

LLM-based Approaches

LLMs are primarily leveraged for their reasoning, planning, and natural language understanding/generation capabilities to guide autonomous driving decisions. A dominant trend in this area is the hybridization or agentic use of LLMs. Pure LLM-driven control is rare due to challenges in real-time performance, safety assurance, and precise numerical output. Instead, LLMs often function at a strategic or tactical level, acting as a "supervisor," "planner," or "reasoner" that guides more traditional or specialized modules like DRL agents or MPC controllers. This hierarchical system leverages the LLM's strengths in high-level reasoning and contextual understanding while offloading operational, real-time aspects to other components. This approach aims to augment specific parts of the AD stack, particularly those requiring human-like commonsense, reasoning about novel situations, or providing interpretable justifications for actions.

Method Introduction Year Project
TeLL-Drive: Enhancing Autonomous Driving with Teacher LLM-Guided Deep Reinforcement Learning
DetailsThis work proposes a hybrid framework where a "teacher" LLM guides an attention-based "student" Deep Reinforcement Learning (DRL) policy. The LLM utilizes Chain-of-Thought (CoT) reasoning, incorporates risk metrics, and retrieves historical scenarios to produce high-level driving strategies. This approach aims to improve the DRL agent's sample complexity and robustness while ensuring real-time feasibility, a common challenge for LLMs when used in isolation for decision-making.
Arxiv 2025 Project
DSDrive: Distilling Large Language Model for Lightweight End-to-End Autonomous Driving with Unified Reasoning and Planning
DetailsDSDrive is a lightweight end-to-end (E2E) autonomous driving framework that uses knowledge distillation technology to migrate the reasoning capabilities of large visual language models (VLMs) to compact LLM-based multimodal large models, and unifies reasoning and planning through a waypoint-driven dual-head coordination module. DSDrive significantly reduces computing requirements while maintaining high performance, providing an efficient and explainable solution for resource-constrained autonomous driving systems.
Arxiv 2025
Dilu: A knowledge-driven approach to autonomous driving with large language models
DetailsDiLu proposes a knowledge-driven framework for autonomous driving that combines Reasoning and Reflection modules within an LLM. This enables the system to make decisions based on common-sense knowledge and to continuously evolve its understanding and strategies through experience.
LCLR 2024 Code / Project
Receive, Reason, and React: Drive as You Say with Large Language Models in Autonomous Vehicles
Details
MITS 2024
LanguageMPC: Large Language Models as Decision Makers for Autonomous Driving
DetailsThis system employs an LLM as a high-level decision-making component, particularly for complex AD scenarios that demand human commonsense understanding. The LLM processes environmental data through scenario encoding, provides action guidance, and adjusts confidence levels. These high-level decisions are then translated into precise parameters for a low-level Model Predictive Controller (MPC), thereby enhancing interpretability and enabling the system to handle complex maneuvers, including multi-vehicle coordination.
Arxiv 2023 Project
A Language Agent for Autonomous Driving
DetailsThis framework positions an LLM as a cognitive agent. The agent has access to a versatile tool library (for perception and prediction tasks), a cognitive memory storing commonsense knowledge and past driving experiences, and a reasoning engine. The reasoning engine is capable of chain-of-thought reasoning, task planning, motion planning, and self-reflection, showcasing a more integrated and sophisticated agentic approach to autonomous driving.
Arxiv 2023 Project
Driving with LLMs: Fusing Object-Level Vector Modality for Explainable Autonomous Driving
Details
LCRA 2024 Code
Drive Like a Human: Rethinking Autonomous Driving with Large Language Models
Details
Arxiv 2023 Code
Empowering autonomous driving with large language models: A safety perspective
DetailsThis research explores the integration of LLMs as intelligent decision-makers within the behavioral planning module of AD systems. A key feature is the augmentation of LLMs with a safety verifier shield, which facilitates contextual safety learning. The paper presents studies on an adaptive LLM-conditioned MPC and an LLM-enabled interactive behavior planning scheme using a state machine, demonstrating improved safety metrics.
ICLR 2024
Driving with Regulation: Interpretable Decision-Making for Autonomous Vehicles with Retrieval-Augmented Reasoning via LLM
DetailsThis framework introduces an interpretable decision-maker that leverages a Traffic Regulation Retrieval (TRR) Agent, built upon Retrieval-Augmented Generation (RAG). This agent automatically retrieves relevant traffic rules and guidelines from extensive documents. An LLM-powered reasoning module then interprets these rules, differentiates between mandatory regulations and safety guidelines, and assesses actions for legal compliance and safety, enhancing transparency.
Arxiv 2024
Towards Human-Centric Autonomous Driving: AFast-Slow Architecture Integrating Large LanguageModel Guidance with Reinforcement Learning
Details
Arxiv 2025 Project
PADriver: Towards Personalized Autonomous Driving
Details
Arxiv 2025
CoT-Drive: Efficient Motion Forecasting for Autonomous Driving with LLMs and Chain-of-Thought Prompting
Details
Arxiv 2025
LearningFlow: Automated Policy Learning Workflow for Urban Driving with Large Language Models
Details
Arxiv 2025
Distilling Multi-modal Large Language Models for Autonomous Driving
Details
Arxiv 2025

VLM-based Approaches

VLMs bring visual understanding to the decision-making process, allowing for richer interpretations of the driving scene and enabling actions based on both visual percepts and linguistic instructions or reasoning. A key challenge for VLMs in AD decision-making is bridging the gap between their often 2D-centric visual-linguistic understanding and the precise, 3D spatio-temporal reasoning essential for safe driving. Many current VLMs are adapted from models pre-trained on large, static 2D image-text datasets , and they can struggle with the dynamic, three-dimensional nature of real-world driving scenarios. This means that while they might excel at describing a scene, their practical performance in actual driving tasks can be concerning. Effective VLMs for AD decision-making will likely need to incorporate stronger 3D visual backbones, improved mechanisms for temporal modeling beyond simple frame concatenation, and potentially integrate more structured environmental representations like Bird's-Eye-View (BEV) maps or scene graphs directly into their reasoning processes.

Method Introduction Year Project
LightEMMA: Lightweight End-to-End Multimodal Model for Autonomous Driving
DetailsAt present, the research on combining autonomous driving with visual language models (VLMs) is becoming more and more popular. VLMs have proved their important role in autonomous driving. This paper introduces a lightweight end-to-end multimodal model LightEMMA for autonomous driving, which can integrate and evaluate current commercial and open source models to study the role and limitations of VLMs in driving tasks, so as to promote the further development of VLMs in autonomous driving.
Arxiv 2025 Code
Extending Large Vision-Language Model for Diverse Interactive Tasks in Autonomous Driving
Details
Arxiv 2025 Code
X-Driver: Explainable Autonomous Driving with Vision-Language Models
Details
Arxiv 2025
AlphaDrive: Unleashing the Power of VLMs in Autonomous Driving via Reinforcement Learning and Reasoning
DetailsAlphaDrive is a VLM tailored for high-level planning in autonomous driving. It integrates a Group Relative Policy Optimization (GRPO)-based reinforcement learning strategy with a two-stage reasoning training approach (Supervised Fine-Tuning followed by RL) to boost planning performance and training efficiency.
Arxiv 2025 Code
Sce2DriveX: A Generalized MLLM Framework for Scene-to-Drive Learning
Details
Arxiv 2025
VLM-MPC: Vision Language Foundation Model (VLM)-Guided Model Predictive Controller (MPC) for Autonomous Driving
Details
ICML 2025
VLM-E2E: Enhancing End-to-End Autonomous Driving with Multi modal Driver Attention Fusion
DetailsThis framework aims to enhance end-to-end autonomous driving by using VLMs to provide driver attentional cues. It integrates textual representations into Bird's-Eye-View (BEV) features for semantic supervision, enabling the model to learn richer feature representations that capture driver attentional semantics. It also introduces a BEV-Text learnable weighted fusion strategy.
Arxiv 2025
VLM-Assisted Continual learning for Visual Question Answering in Self-Driving
Details
Arxiv 2025
WiseAD: Knowledge Augmented End-to-End Autonomous Driving with Vision-Language Model
Details
Arxiv 2024 Code / Project
CALMM-Drive: Confidence-Aware Autonomous Driving with Large Multi modal Model
Details
Arxiv 2024
OpenEMMA: Open-Source Multimodal Model for End-to-End Autonomous Driving
Details
WACV 2025 Code
VLM-AD: End-to-End Autonomous Driving through Vision-Language Model Supervision
DetailsThis method positions VLMs as "teachers" to generate reasoning-based text annotations and structured action labels. These annotations serve as supplementary supervisory signals for training end-to-end AD models, aiming to improve their understanding beyond simple trajectory labels without requiring the VLM at inference time.
Arxiv 2024
TS-VLM: Text-Guided SoftSort Pooling for Vision-Language Models in Multi-View Driving Reasoning
DetailsThis work introduces a lightweight VLM designed for efficient multi-view reasoning in autonomous driving. It features a novel Text-Guided SoftSort Pooling (TGSSP) module that dynamically ranks and fuses visual features from multiple camera views based on the semantics of input queries. This query-aware aggregation aims to improve contextual accuracy and reduce computational overhead, making it more practical for real-time deployment.
Arxiv 2025
LightEMMA: Lightweight End-to-End Multimodal Model for Autonomous Driving
DetailsThis work introduces a lightweight VLM designed for efficient multi-view reasoning in autonomous driving. It features a novel Text-Guided SoftSort Pooling (TGSSP) module that dynamically ranks and fuses visual features from multiple camera views based on the semantics of input queries. This query-aware aggregation aims to improve contextual accuracy and reduce computational overhead, making it more practical for real-time deployment.
Arxiv 2025 Code
DriveGPT4: Interpretable End-to-end Autonomous Driving via Large Language Model
DetailsThis Multimodal Large Language Model (MLLM) is designed for interpretable end-to-end autonomous driving. It processes multi-frame video inputs and textual queries to interpret vehicle actions, provide relevant reasoning, and predict low-level control signals. A bespoke visual instruction tuning dataset aids its capabilities.
RAL 2024 Project
ADAPT: Action-aware Driving Caption Transformer
DetailsWhile primarily a transformer architecture, ADAPT functions similarly to a VLM by generating natural language narrations and reasoning for driving actions based on video input. It jointly trains the driving captioning task and the vehicular control prediction task, enhancing interpretability in decision-making.
ICRA 2023 Code
Continuously Learning, Adapting, and Improving: A Dual-Process Approach to Autonomous Driving
DetailsThis system uses a VLM for scene understanding, specifically to provide descriptions of critical objects that may influence driving decisions. This visual-linguistic understanding then feeds into a dual-process decision-making module composed of an Analytic LLM and a Heuristic lightweight language model.
NeurIPS 2024 Code
DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models
DetailsThis system leverages VLMs for enhanced scene understanding and planning capabilities. It also proposes DriveVLM-Dual, a hybrid system that combines the strengths of DriveVLM with traditional AD pipelines to address VLM limitations in spatial reasoning and computational requirements, particularly for long-tail critical objects.
Arxiv 2024 Project
LingoQA: Visual Question Answering for Autonomous Driving
DetailsThis project introduces a benchmark and a large dataset (419.9k QA pairs from 28K unique video scenarios) for video question answering specifically in the autonomous driving domain. It focuses on evaluating a VLM's ability to perform reasoning, justify actions, and describe scenes, and proposes the Lingo-Judge metric for evaluation.
ECCV 2024 Code
Training-Free Open-Ended Object Detection and Segmentation via Attention as Prompts
DetailsThis training-free framework combines a VLM (for generalized object recognition, e.g., recognizing rare objects in AD scenarios) with the Segment-Anything Model (SAM, for generalized object localization). It uses attention maps from the VLM as prompts for SAM to address open-ended object detection and segmentation, which is crucial for robust perception feeding into decision-making systems.
NeurIPS 2024
ORION: A Holistic End-to-End Autonomous Driving Framework by Vision-Language Instructed Action Generation
Details
Arxiv 2025 Code
Generative Planning with 3D-vision Language Pre-training for End-to-End Autonomous Driving
Details
Arxiv 2025
FutureSightDrive: Visualizing Trajectory Planning with Spatio-Temporal CoT for Autonomous Driving
DetailsThe paper presents FSDrive, a framework that enables autonomous vehicles to perform visual reasoning for trajectory planning using a spatio-temporal Chain-of-Thought (CoT). Instead of relying on abstract text-based logic, FSDrive uses a visual language model to generate future scene representations as images, capturing spatial and temporal dynamics. It introduces a lightweight pretraining method to activate image generation in existing models and employs these visual predictions as intermediate reasoning steps.
Arxiv 2025 Code
Drive-R1: Bridging Reasoning and Planning in VLMs for Autonomous Driving with Reinforcement Learning
DetailsThe article proposes Drive-R1, a domain-specific vision-language model for autonomous driving that integrates structured reasoning and trajectory planning using reinforcement learning. It addresses two key limitations in current VLM-based planning systems: overdependence on textual history over visual input, and poor alignment between reasoning and planning. By combining supervised fine-tuning on a curated chain-of-thought dataset with Group Relative Policy Optimization, Drive-R1 improves trajectory accuracy and safety. Experimental results on nuScenes and DriveLM-nuScenes show marginal gains over strong baselines. While methodologically sound, the contribution is incremental—the core advancement lies more in system integration than in fundamental algorithmic innovation.
Arxiv 2025

VLA-based Approaches

VLAs aim to create more generalist agents that can perceive, reason, and act, often in an end-to-end fashion. For autonomous driving, this means models that can take raw sensor data and high-level goals to produce driving actions. While VLAs offer the promise of true end-to-end decision-making by unifying perception, reasoning, and action generation , their application in safety-critical autonomous driving faces a significant hurdle: ensuring the reliability and verifiability of actions generated by these complex, often black-box, generative models. The potential for "hallucinated" or unexpected outputs from generative models is a recurring concern. A major research direction for VLAs in AD will involve developing methods for safety validation, uncertainty quantification, and robust fallback mechanisms. This might include hybrid approaches where VLA outputs are monitored or constrained by traditional safety layers, or novel training paradigms that explicitly optimize for safety and predictability.

Method Introduction Year Project
OpenDriveVLA: Towards End-to-end Autonomous Driving with Large Vision Language Action Model
DetailsThis is an end-to-end VLA model specifically designed for autonomous driving. It generates reliable driving actions conditioned on 3D environmental perception, ego vehicle states, and driver commands. Key methodological contributions include a hierarchical vision-language alignment process to bridge the modality gap between driving visual representations and language embeddings, and an autoregressive agent-env-ego interaction process to ensure spatially and behaviorally informed trajectory planning.
Arxiv 2025 Code
DriveMoE: Mixture-of-Experts for Vision-Language-Action Model in End-to-End Autonomous Driving
Details
Arxiv 2025 Project
Impromptu VLA: Open Weights and Open Data for Driving Vision-Language-Action Models
Details
Arxiv 2025 Code / Project
DiffVLA: Vision-Language Guided Diffusion Planning for Autonomous Driving
Details
Arxiv 2025
AutoVLA: A Vision-Language-Action Model for End-to-End Autonomous Driving with Adaptive Reasoning and Reinforcement Fine-Tuning
DetailsThe article presents AutoVLA, an automated framework for training vision-language agents (VLAs) in autonomous driving. It generates training data by combining a rule-based planner, a visual perception model, and GPT-4 for reasoning, eliminating the need for manual annotation. AutoVLA fine-tunes a VLA using this data and further improves it with a reward model for trajectory quality. The resulting agent achieves competitive performance on the DriveLM benchmark. The main contribution is a scalable, automated pipeline for training driving VLAs, emphasizing practicality over architectural novelty.
Arxiv 2025 Code / Project

Categorization by Research Direction

This section groups papers based on overarching research themes and objectives within AD decision-making.

End-to-End Driving Models

These models aim to learn a direct mapping from sensor inputs to driving actions or high-level plans, often minimizing handcrafted intermediate representations. The pursuit of end-to-end (E2E) models in autonomous driving is driven by the ambition to reduce error propagation inherent in modular pipelines and to potentially uncover novel, more effective driving strategies that might not emerge from separately optimized components. However, the "black box" nature and significant data requirements of traditional E2E deep learning models have been persistent challenges. The integration of LLMs, VLMs, and VLAs into E2E frameworks represents an effort to mitigate these issues by infusing these models with enhanced reasoning capabilities, better generalization from pre-training, and avenues for interpretability. This suggests a future where E2E AD systems are not purely opaque mappings but incorporate a semantic layer or reasoning backbone provided by foundation models, thus addressing key criticisms of earlier E2E approaches.

Method Introduction Year Project
LightEMMA: Lightweight End-to-End Multimodal Model for Autonomous Driving
DetailsThis framework is specifically designed for evaluating various VLMs in an end-to-end fashion for autonomous driving planning tasks. It provides an open-source baseline workflow for integrating VLMs into E2E planning, enabling rapid prototyping.
Arxiv 2025 Code
DriveGPT4: Interpretable End-to-end Autonomous Driving via Large Language Model
DetailsDriveGPT4 is presented as an interpretable end-to-end autonomous driving system based on LLMs. It processes multi-frame video inputs and textual queries, predicts low-level vehicle control signals, and offers reasoning for its actions.
RAL 2024 Project
OpenDriveVLA: Towards End-to-end Autonomous Driving with Large Vision Language Action Model
DetailsThis VLA model is explicitly designed for end-to-end autonomous driving. It generates reliable driving trajectories conditioned on multimodal inputs including 3D environmental perception, ego vehicle state, and driver commands.
Arxiv 2025 Code
ADAPT: Action-aware Driving Caption Transformer
DetailsADAPT proposes an end-to-end transformer-based architecture that jointly trains a driving captioning task and a vehicular control prediction task through a shared video representation, aiming for user-friendly narration and reasoning.
ICRA 2023 Code
LMDrive: Closed-Loop End-to-End Driving with Large Language Models
DetailsThis work focuses on closed-loop end-to-end driving specifically with large language models, indicating a direct application of LLMs in the E2E driving pipeline.
CVPR 2024 Code
VLM-E2E: Enhancing End-to-End Autonomous Driving with Multi modal Driver Attention Fusion
DetailsThis research aims to enhance end-to-end autonomous driving by using VLMs to provide attentional cues and fusing multimodal information (BEV and text features) for semantic supervision.
Arxiv 2025
VLM-AD: End-to-End Autonomous Driving through Vision-Language Model Supervision
DetailsThis method leverages VLMs as teachers to provide reasoning-based text annotations, which serve as supplementary supervisory signals to train end-to-end AD pipelines, extending beyond standard trajectory labels.
Arxiv 2024
GenAD: Generative End-to-End Autonomous Driving
DetailsGenAD models autonomous driving as a trajectory generation problem, adopting an instance-centric scene tokenizer and a variational autoencoder for trajectory prior modeling in an E2E setup.
Arxiv 2024 Code
ORION: A Holistic End-to-End Autonomous Driving Framework by Vision-Language Instructed Action Generation
Details
Arxiv 2025 Code
Distilling Multi-modal Large Language Models for Autonomous Driving
Details
Arxiv 2025
Generative Planning with 3D-vision Language Pre-training for End-to-End Autonomous Driving
Details
Arxiv 2025
Impromptu VLA: Open Weights and Open Data for Driving Vision-Language-Action Models
Details
Arxiv 2025 Code / Project
DiffVLA: Vision-Language Guided Diffusion Planning for Autonomous Driving
Details
Arxiv 2025
AutoVLA: A Vision-Language-Action Model for End-to-End Autonomous Driving with Adaptive Reasoning and Reinforcement Fine-Tuning
DetailsThe article presents AutoVLA, an automated framework for training vision-language agents (VLAs) in autonomous driving. It generates training data by combining a rule-based planner, a visual perception model, and GPT-4 for reasoning, eliminating the need for manual annotation. AutoVLA fine-tunes a VLA using this data and further improves it with a reward model for trajectory quality. The resulting agent achieves competitive performance on the DriveLM benchmark. The main contribution is a scalable, automated pipeline for training driving VLAs, emphasizing practicality over architectural novelty.
Arxiv 2025 Code / Project

Interpretability and Explainable AI (XAI)

Focuses on making the decision-making processes of AD systems transparent and understandable to humans. The integration of LLMs and VLMs is pushing XAI in autonomous driving beyond simple attention maps or feature visualizations towards generating natural language explanations and justifications that are genuinely comprehensible to human users, including passengers and regulators. This is crucial for building public trust, facilitating regulatory approval, and enabling more effective human-AI collaboration in the driving context. The challenge, however, lies in ensuring that these generated explanations are faithful to the model's actual decision-making process and are not merely plausible-sounding rationalizations generated post-hoc. Future work will need to concentrate on methods that tightly couple the reasoning and explanation generation with the core decision logic of the AD system.

Method Introduction Year Project
LanguageMPC: Large Language Models as Decision Makers for Autonomous Driving
DetailsThe use of LLM-generated high-level decisions is explicitly stated to improve interpretability in complex autonomous driving scenarios. The system aims to make the "thinking process" visible.
Arxiv 2023
LightEMMA: Lightweight End-to-End Multimodal Model for Autonomous Driving
DetailsThis VLM framework employs Chain-of-Thought (CoT) prompting. CoT is used to enhance interpretability and facilitate structured reasoning within the VLM-based driving agents, allowing the model to output its reasoning steps.
Arxiv 2025 Code
DriveGPT4: Interpretable End-to-end Autonomous Driving via Large Language Model
DetailsA primary goal of DriveGPT4 is to develop an interpretable end-to-end autonomous driving solution. The MLLM is designed to interpret vehicle actions, offer pertinent reasoning, and address user queries, thereby making the system's behavior understandable.
RAL 2024 Project
ADAPT: Action-aware Driving Caption Transformer
DetailsThis framework provides user-friendly natural language narrations and reasoning for each decision-making step of autonomous vehicular control and action. For example, it can output "[Action narration:] the car pulls over to the right side of the road, because the car is parking".
ICRA 2023 Code
LingoQA: Visual Question Answering for Autonomous Driving
Details By benchmarking video question answering, LingoQA facilitates the development of models that can justify actions and describe scenes in natural language, directly contributing to the explainability of driving decisions.
ECCV 2024 Code
Driving with Regulation: Interpretable Decision-Making for Autonomous Vehicles with Retrieval-Augmented Reasoning via LLM
DetailsThe reasoning module in this framework is explicitly designed to be interpretable, enhancing transparency in how traffic rules are identified, interpreted, and applied to driving decisions.
Arxiv 2024
Explainable Artificial Intelligence for Autonomous Driving: A Comprehensive Overview and Field Guide for Future Research Directions
DetailsThese resources provide a broader context on XAI in autonomous vehicles. They emphasize that XAI serves to bridge complex technological capabilities with human understanding, addressing safety assurance, regulatory compliance, and public trust. XAI can provide real-time justifications for actions (e.g., sudden braking) and post-hoc explanations (e.g., visual heat maps, natural language descriptions).
Arxiv 2024
RAG-Driver: Generalisable Driving Explanations with Retrieval-Augmented In-Context Learning in Multi-Modal Large Language Model
DetailsThis work aims to provide generalisable driving explanations by employing retrieval-augmented in-context learning within Multi-Modal Large Language Models.
Arxiv 2024 Code / Project

Safety-Critical Decision-Making & Long-Tail Scenarios

Addresses the paramount challenge of ensuring safety, especially in rare, unforeseen (long-tail) situations where traditional, purely data-driven systems often falter due to lack of representative data. Effectively handling these scenarios requires more than just scaling up models; it demands robust reasoning, the integration of explicit and implicit knowledge (including safety rules and commonsense), and rigorous validation methodologies. The integration of LLMs and VLMs offers a promising avenue by leveraging their potential for abstract reasoning and broad knowledge. However, the inherent risk of these models generating incorrect or "hallucinated" outputs in critical situations necessitates a cautious approach. Future progress in this area will likely depend on hybrid architectures that combine the generalization capabilities of foundation models with explicit safety layers or verifiers, methods for effectively injecting structured safety knowledge (like traffic laws or physical constraints) into the decision-making loop, and the development of advanced simulation and testing protocols specifically designed to probe behavior in diverse long-tail scenarios and rigorously evaluate safety.

Method Introduction Year Project
LanguageMPC: Large Language Models as Decision Makers for Autonomous Driving
DetailsAims to improve generalization to rare events by leveraging the commonsense reasoning capabilities of LLMs, which is crucial for safety in unexpected situations.
Arxiv 2023
OpenDriveVLA: Towards End-to-end Autonomous Driving with Large Vision Language Action Model
DetailsThis VLA model specifically targets the challenges of limited generalization to long-tail scenarios and insufficient understanding of high-level semantics within complex driving scenes, which are critical for safe decision-making.
Arxiv 2025 Code
Generative AI for Autonomous Driving: Frontiers and Opportunities
DetailsThis survey identifies comprehensive generalization across rare cases and the development of robust evaluation and safety checks as key obstacles and future opportunities for GenAI in autonomous driving.
Arxiv 2025 Project
Empowering Autonomous Driving with Large Language Models: A Safety Perspective
DetailsThis work directly focuses on enhancing safety by proposing methodologies that employ LLMs as intelligent decision-makers in behavioral planning, augmented with a safety verifier shield for contextual safety learning.
Arxiv 2024
Tokenize the World into Object-level Knowledge to Address Long-tail Events in Autonomous Driving
DetailsThis MM-LLM is designed to address long-tail events by tokenizing the world into object-level knowledge, enabling better utilization of an LLM's reasoning capabilities for enhanced autonomous vehicle planning in such scenarios.
Arxiv 2024
VLM-AD: End-to-End Autonomous Driving through Vision-Language Model Supervision
DetailsThis method addresses the limitations of E2E models in handling diverse real-world scenarios by using VLMs as teachers to provide reasoning-based supervision, which can help in understanding and reacting to less common situations.
Arxiv 2024

Reinforcement Learning for Decision-Making

Utilizes Reinforcement Learning (RL) techniques, often guided or enhanced by foundation models (LLMs/VLMs), to learn optimal driving policies through interaction with simulated or real-world environments. The synergy between foundation models and RL represents a powerful emerging trend in autonomous driving. Foundation models can address key RL challenges, such as high sample complexity and difficult reward design, by providing high-level guidance, superior state representations, or even by directly shaping the reward function itself. This combination can lead to more data-efficient learning of complex driving behaviors, improved generalization to novel scenarios by leveraging the pre-trained knowledge embedded in foundation models, and more interpretable reward structures, especially if derived from linguistic goals. The primary research focus in this area will likely be on optimizing the structure of this synergy—for example, determining whether the LLM should act as a planner for an RL agent, a reward shaper, or if end-to-end RL fine-tuning of a VLM is the most effective approach.

Method Introduction Year Project
TeLL-Drive: Enhancing Autonomous Driving with Teacher LLM-Guided Deep Reinforcement Learning
DetailsThis framework explicitly combines a "teacher" LLM with a "student" Deep Reinforcement Learning (DRL) agent. The LLM, using Chain-of-Thought reasoning and incorporating risk metrics and historical data, guides the DRL policy. This guidance aims to accelerate policy convergence and boost robustness across diverse driving conditions, mitigating DRL's high sample complexity and the LLM's real-time decision-making challenges.
Arxiv 2025 Project
AlphaDrive: Unleashing the Power of VLMs in Autonomous Driving via Reinforcement Learning and Reasoning
DetailsAlphaDrive proposes a VLM-based reinforcement learning and reasoning framework specifically for autonomous driving planning. It introduces four Group Relative Policy Optimization (GRPO)-based RL rewards tailored for planning (planning accuracy, action-weighted, planning diversity, planning format) and employs a two-stage planning reasoning training strategy that combines Supervised Fine-Tuning (SFT) with RL. This approach is shown to significantly improve planning performance and training efficiency.
Arxiv 2025 Project
LORD: Large Models based Opposite Reward Design for Autonomous Driving
DetailsThis work introduces a novel approach to reward design in RL for autonomous driving. Instead of defining desired linguistic goals (e.g., "drive safely"), which can be ambiguous, LORD leverages large pretrained models (LLMs/VLMs) to focus on concrete undesired linguistic goals (e.g., "collision"). This allows for more efficient use of these models as zero-shot reward models, aiming for safer and enhanced autonomous driving with improved generalization.
WACV 2024
NaVILA: Legged Robot Vision-Language-Action Model for Navigation
DetailsWhile the high-level decision-making in NaVILA is handled by a VLA generating linguistic commands, the low-level locomotion policy responsible for executing these commands is trained using RL. This demonstrates a hierarchical approach where RL handles the dynamic execution based on VLA guidance.
RSS 2025 Project
Towards Human-Centric Autonomous Driving: A Fast-Slow Architecture Integrating Large Language Model Guidance with Reinforcement Learning
DetailsThe article presents a fast-slow autonomous driving framework that combines an LLM for interpreting high-level user instructions with an RL agent for real-time control. The LLM generates structured directives based on context and memory, while the RL module ensures safe execution under dynamic conditions. Experiments demonstrate improved safety, comfort, and user alignment over baseline methods.
Arxiv 2025
Fine-Tuning Large Vision-Language Models as Decision-Making Agents via Reinforcement Learning
DetailsThis research explores fine-tuning VLMs with RL for general multi-step goal-directed tasks. The VLM generates Chain-of-Thought reasoning leading to a text-based action, which is then parsed and executed in an interactive environment to obtain task rewards for RL-based fine-tuning. This general methodology is highly relevant for training decision-making agents in autonomous driving.
NeurIPS 2024 Code / Project
Drive-R1: Bridging Reasoning and Planning in VLMs for Autonomous Driving with Reinforcement Learning
DetailsThe article proposes Drive-R1, a domain-specific vision-language model for autonomous driving that integrates structured reasoning and trajectory planning using reinforcement learning. It addresses two key limitations in current VLM-based planning systems: overdependence on textual history over visual input, and poor alignment between reasoning and planning. By combining supervised fine-tuning on a curated chain-of-thought dataset with Group Relative Policy Optimization, Drive-R1 improves trajectory accuracy and safety. Experimental results on nuScenes and DriveLM-nuScenes show marginal gains over strong baselines. While methodologically sound, the contribution is incremental—the core advancement lies more in system integration than in fundamental algorithmic innovation.
Arxiv 2025

World Models for Prediction and Planning

Involves building internal representations of the environment and its dynamics to predict future states and plan actions accordingly. Foundation models, particularly generative models like Diffusion Models, GANs, LLMs, and VLMs, are becoming key enablers for constructing more powerful and versatile world models for autonomous driving. These advanced world models are evolving beyond simple state prediction to encompass rich semantic understanding, the generation of diverse future scenarios (including long-tail events), and even interaction with language-based instructions or goals. This fusion of generative world models with the reasoning capabilities of LLMs/VLMs can lead to AD systems that perform sophisticated "what-if" analyses, anticipate a broader range of future possibilities, and plan more robustly by "imagining" the consequences of actions within a semantically rich, simulated future. This also has profound implications for creating highly realistic and controllable simulation environments for training and testing AD systems.

Method Introduction Year Project
3D-VLA: A 3D Vision-Language-Action Generative World Model
DetailsThis framework explicitly incorporates a generative world model within its 3D vision-language-action architecture. It allows the model to "imagine" future scenarios by predicting goal images and point clouds, which then guide action planning. The model is built on a 3D-LLM and uses interaction tokens to engage with the environment.
ICML 2024 Code
DriveDreamer-2: LLM-Enhanced World Models for Diverse Driving Video Generation
DetailsDriveDreamer utilizes a powerful diffusion model to construct a comprehensive representation of the driving environment. It can generate future driving videos and driving policies, effectively acting as a multimodal world model. DriveDreamer-2 enhances this by incorporating an LLM to generate user-defined driving videos with improved temporal and spatial coherence.
Arxiv 2024 Project
GAIA-1: A Generative World Model for Autonomous Driving
DetailsDeveloped by Wayve, GAIA-1 is a generative world model that leverages video, text, and action inputs to generate realistic driving scenarios. It serves as a valuable neural simulator for autonomous driving.
Arxiv 2023
Driving into the Future: Multiview Visual Forecasting and Planning with World Model for Autonomous Driving
DetailsThis is a multiview world model capable of generating high-quality, controllable, and consistent multiview videos in autonomous driving scenes. It also explores applications in end-to-end planning.
Arxiv 2023 Code / Project.
TrafficBots: Towards World Models for Autonomous Driving Simulation and Motion Prediction
DetailsThis research aims towards developing world models specifically for autonomous driving simulation and motion prediction.
ICRA 2023 Code
UniWorld: Autonomous Driving Pre-training via World Models
DetailsFocuses on autonomous driving pre-training via world models, suggesting that learning a world model can provide a foundational understanding for downstream AD tasks.
Arxiv 2023
Occ-LLM: Enhancing Autonomous Driving with Occupancy-Based Large Language Models
DetailsWhile not explicitly a "world model" in the generative sense of predicting future full scenes, Occ-LLM uses occupancy representations with LLMs, which is a form of modeling the current state of the world for decision-making.
ICRA 2025

Hierarchical Planning and Control

Decomposes the complex driving task into multiple levels of abstraction, often with foundation models handling higher-level reasoning and planning, and other specialized modules executing lower-level control actions. This approach mirrors human cognitive strategies where high-level goals are broken down into manageable sub-tasks. In autonomous driving, this means LLMs or VLMs might determine a strategic maneuver (e.g., "prepare to change lanes and overtake"), which is then translated into a sequence of tactical actions (e.g., check mirrors, signal, adjust speed, steer) executed by a more traditional planner or controller. This layered approach allows for leveraging the strengths of foundation models in complex reasoning and language understanding, while relying on established methods for precise, real-time vehicle control, potentially offering a more robust and interpretable path to autonomy.

Method Introduction Year Project
OpenDriveVLA: Towards End-to-end Autonomous Driving with Large Vision Language Action Model
DetailsThis VLA model employs a hierarchical vision-language alignment process. While aiming for end-to-end action generation, its internal mechanisms likely involve different levels of representation and processing for perception, reasoning, and action generation, contributing to spatially and behaviorally informed trajectory planning.
Arxiv 2025 Code
NaVILA: Legged Robot Vision-Language-Action Model for Navigation
DetailsThis is a prime example of hierarchical control in a VLA framework. The high-level VLM processes visual input and natural language instructions to generate mid-level actions expressed in natural language (e.g., "turn right 30 degrees," "move forward 75cm"). These linguistic commands are then interpreted and executed by a separate low-level visual locomotion policy trained with reinforcement learning. This decouples complex reasoning from real-time motor control.
RSS 2025 Project
LanguageMPC: Large Language Models as Decision Makers for Autonomous Driving
Details This system uses an LLM for high-level decision-making (scenario encoding, action guidance). These high-level textual decisions are then translated into precise mathematical representations and parameters that guide a low-level Model Predictive Controller (MPC) responsible for the actual driving commands. This clearly separates strategic decision-making from operational control.
Arxiv 2023
TeLL-Drive: Enhancing Autonomous Driving with Teacher LLM-Guided Deep Reinforcement Learning
DetailsHere, a "teacher" LLM produces high-level driving strategies through chain-of-thought reasoning. These strategies then guide an attention-based "student" DRL policy, which handles the final decision-making and action execution. This represents a hierarchical guidance mechanism.
Arxiv 2025 Project
DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models
Details DriveVLM integrates reasoning modules for scene description, scene analysis, and hierarchical planning. The DriveVLM-Dual system further exemplifies this by using low-frequency motion plans from the VLM as initial plans for a faster, traditional refining planner.
Arxiv 2024 Project
Continuously Learning, Adapting, and Improving: A Dual-Process Approach to Autonomous Driving
DetailsThis system incorporates a dual-process decision-making module. The Analytic Process (System-II, using an LLM) performs thorough analysis and reasoning, accumulating experience. This experience is then transferred to a lightweight Heuristic Process (System-I) for swift, empirical decision-making. This can be seen as a cognitive hierarchy.
NeurIPS 2024 Code
Mixing Left and Right-Hand Driving Data in a Hierarchical Framework With LLM Generation
DetailsWhile focused on data compatibility for trajectory prediction, this work proposes a hierarchical framework, suggesting that different levels of processing or adaptation might be needed when dealing with complex data sources, which can inform hierarchical planning approaches.
RAL 2024

Categorization by Application Field in Decision-Making

This section organizes papers based on the specific aspect of the autonomous driving decision-making pipeline they primarily address.

Perception-Informed Decision-Making

These works focus on how enhanced perception, often through LLMs/VLMs, directly informs or enables better downstream decision-making. This involves not just detecting objects but understanding their context, relationships, and potential impact on driving strategy.

Method Introduction Year Project
TS-VLM: Text-Guided SoftSort Pooling for Vision-Language Models in Multi-View Driving Reasoning
DetailsImproves multi-view driving reasoning by dynamically fusing visual features based on text queries. This enhanced scene understanding directly supports more informed decision-making by providing better contextual accuracy from multiple viewpoints.
Arxiv 2025
OpenDriveVLA: Towards End-to-end Autonomous Driving with Large Vision Language Action Model
DetailsConditions driving actions on 3D environmental perception, ego vehicle states, and driver commands. The hierarchical vision-language alignment projects 2D and 3D visual tokens into a unified semantic space, enabling perception to directly guide trajectory generation.
Arxiv 2025 Code
DriveGPT4: Interpretable End-to-end Autonomous Driving via Large Language Model
DetailsProcesses multi-frame video inputs to interpret vehicle actions and offer reasoning. The decision to predict control signals is directly informed by its multimodal understanding of the visual scene and textual queries.
RAL 2024
Continuously Learning, Adapting, and Improving: A Dual-Process Approach to Autonomous Driving
DetailsThe VLM component is crucial for scene understanding, providing descriptions of critical objects that influence driving decisions. This perception output is the direct input to the dual-process decision-making module.
NeurIPS 2024 Code
DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models
DetailsLeverages VLMs for enhanced scene understanding (scene description, scene analysis) which then feeds into its hierarchical planning modules. The perception of long-tail critical objects is a key focus.
Arxiv 2024 Project
Training-Free Open-Ended Object Detection and Segmentation via Attention as Prompts
DetailsCombines VLM recognition with SAM segmentation for open-ended object detection. While primarily a perception method, accurate detection and localization of all objects, including rare ones, is fundamental for safe decision-making.
NeurIPS 2024
VLM-E2E: Enhancing End-to-End Autonomous Driving with Multimodal Driver Attention Fusion
DetailsIntegrates textual representations (driver attentional cues from VLMs) into Bird's-Eye-View (BEV) features for semantic supervision, enabling the model to learn richer feature representations that explicitly capture driver's attentional semantics, directly impacting driving decisions.
Arxiv 2025
HiLM-D: Enhancing MLLMs with Multi-Scale High-Resolution Details for Autonomous Driving
DetailsFocuses on high-resolution understanding in MLLMs for AD, specifically for identifying, explaining, and localizing risk objects (ROLISP task), which is a critical perceptual input for safe decision-making.
IJCV 2025 Project
Talk2BEV: Language-enhanced Bird's-eye View Maps for Autonomous Driving
DetailsProvides a language-enhanced interface for BEV maps, allowing natural language queries to interpret complex driving scenes represented in BEV, thus informing situational awareness for decision-making.
ICRA 2024 Code / Project

Behavioral Planning & Prediction

Concerns predicting the future behavior of other road users (vehicles, pedestrians, cyclists) and planning the ego-vehicle's behavior in response, often involving understanding intentions and social interactions.

Method Introduction Year Project
LanguageMPC: Large Language Models as Decision Makers for Autonomous Driving
DetailsThe LLM performs scenario encoding, which includes predicting future trajectories of other vehicles and selecting the most likely one. This predictive capability informs its high-level action guidance for the ego vehicle.
Arxiv 2023
OpenDriveVLA: Towards End-to-end Autonomous Driving with Large Vision Language Action Model
DetailsModels dynamic relationships between the ego vehicle, surrounding agents, and static road elements through an autoregressive agent-env-ego interaction process. This explicit modeling of interactions is key for behavioral planning and prediction.
Arxiv 2025 Code
AlphaDrive: Unleashing the Power of VLMs in Autonomous Driving via Reinforcement Learning and Reasoning
DetailsTailored for high-level planning in autonomous driving, which inherently involves deciding on driving behaviors (e.g., lane changes, yielding) based on the current scene and predicted future states. The emergent multimodal planning capabilities also hint at understanding complex interactions.
Arxiv 2025 Code
DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models
DetailsIncludes modules for scene analysis that analyze possible intent-level behavior of critical objects, feeding into its hierarchical planning. This is directly related to predicting other agents' behaviors.
Arxiv 2024 Project
Empowering Autonomous Driving with Large Language Models: A Safety Perspective
Details Focuses on LLMs as intelligent decision-makers in behavioral planning, augmented with safety verifiers. This includes an LLM-enabled interactive behavior planning scheme.
ICLR 2024
A Language Agent for Autonomous Driving
DetailsThe LLM agent performs task planning and motion planning based on perceived and predicted states of the environment, including interactions with other agents.
COLM 2024 Code / Project
LC-LLM: Explainable Lane-Change Intention and Trajectory Predictions with Large Language Models
DetailsSpecifically uses LLMs for predicting lane-change intentions and trajectories, providing explainable predictions for this critical driving behavior.
CTR 2025
Large Language Models Powered Context-aware Motion Prediction in Autonomous Driving
DetailsLeverages GPT-4V to comprehend complex traffic scenarios and combines this contextual information with traditional motion prediction models (like MTR) to improve behavioral prediction.
IROS 2024 Code / Project

Motion Planning & Trajectory Generation

Focuses on generating safe, comfortable, and feasible paths or sequences of waypoints for the autonomous vehicle to follow.

Method Introduction Year Project
LanguageMPC: Large Language Models as Decision Makers for Autonomous Driving
Details While the LLM provides high-level decisions, these are translated to guide a Model Predictive Controller (MPC) which performs the fine-grained trajectory planning and control.
Arxiv 2023
LightEMMA: Lightweight End-to-End Multimodal Model for Autonomous Driving
DetailsEvaluates VLMs on the nuScenes prediction task, where predicted control actions are numerically integrated to produce a predicted trajectory, which is then compared against ground truth. This is a form of motion planning.
Arxiv 2025 Code
DriveGPT4: Interpretable End-to-end Autonomous Driving via Large Language Model
DetailsPredicts low-level vehicle control signals (speed, turning angle) in an end-to-end fashion, which implicitly defines a trajectory.
RAL 2024
AlphaDrive: Unleashing the Power of VLMs in Autonomous Driving via Reinforcement Learning and Reasoning
DetailsSpecifically designed for autonomous driving planning, generating high-level plans that would then be refined into detailed trajectories. The rewards are tailored for planning accuracy and action importance.
Arxiv 2025 Code
DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models
DetailsFeatures hierarchical planning modules. The DriveVLM-Dual system uses coarse, low-frequency waypoints from the VLM as an initial plan for a faster refining planner.
Arxiv 2024 Project
GPT-Driver: Learning to Drive with GPT
DetailsModels motion planning as a language modeling problem, leveraging the LLM to generate driving trajectories by aligning its output with human driving behavior.
NeurIPS 2023 Code / Project
GenAD: Generative End-to-End Autonomous Driving
DetailsModels autonomous driving as a trajectory generation problem using an instance-centric scene tokenizer and a variational autoencoder for trajectory prior modeling.
Arxiv 2024 Code
VLP: Vision Language Planning for Autonomous Driving
DetailsProposes a Vision Language Planning model composed of ALP (Action Localization and Prediction) and SLP (Safe Local Planning) components to improve ADS from BEV reasoning and decision-making for planning.
CVPR 2024
FutureSightDrive: Visualizing Trajectory Planning with Spatio-Temporal CoT for Autonomous Driving
DetailsThe paper presents FSDrive, a framework that enables autonomous vehicles to perform visual reasoning for trajectory planning using a spatio-temporal Chain-of-Thought (CoT). Instead of relying on abstract text-based logic, FSDrive uses a visual language model to generate future scene representations as images, capturing spatial and temporal dynamics. It introduces a lightweight pretraining method to activate image generation in existing models and employs these visual predictions as intermediate reasoning steps.
Arxiv 2025 Code

Direct Control Signal Generation

Involves models that directly output low-level control commands for the vehicle, such as steering angle and acceleration/braking.

Method Introduction Year Project
DriveGPT4: Interpretable End-to-end Autonomous Driving via Large Language Model
Details Explicitly states that it predicts low-level vehicle control signals (vehicle speed and turning angle) in an end-to-end fashion.
RAL 2024
LightEMMA: Lightweight End-to-End Multimodal Model for Autonomous Driving
DetailsThe final stage of its Chain-of-Thought prompting explicitly outputs a sequence of predicted control actions (e.g., [(v1, c1), (v2, c2),...]) which are then numerically integrated to produce the predicted trajectory.
Arxiv 2025 Code
ADAPT: Action-aware Driving Caption Transformer
DetailsJointly trains a vehicular control signal prediction task alongside driving captioning. The CSP head predicts control signals (e.g., speed, acceleration) based on video frames.
ICRA 2023 Code

Human-AI Interaction & Command Understanding

Focuses on enabling autonomous vehicles to understand and respond to human language commands, preferences, or queries, facilitating more natural and intuitive interaction.

Method Introduction Year Project
Towards Human-Centric Autonomous Driving: A Fast-Slow Architecture Integrating Large Language Model Guidance with Reinforcement Learning
Details
Arxiv 2025 Project
DriveGPT4: Interpretable End-to-end Autonomous Driving via Large Language Model
DetailsCapable of processing textual queries from users and providing natural language responses, such as describing vehicle actions or explaining reasoning.
RAL 2024
LanguageMPC: Large Language Models as Decision Makers for Autonomous Driving
DetailsAllows driving behavior adjustment (e.g., conservative vs. aggressive) based on textual inputs from users or high-precision maps.
Arxiv 2023
Dolphins: Multimodal Language Model for Driving
DetailsDeveloped as a VLM-based conversational driving assistant, capable of understanding and responding to human interaction.
Arxiv 2023 Code / Project
Talk2BEV: Language-enhanced Bird's-eye View Maps for Autonomous Driving
DetailsProvides a large vision-language model interface for bird’s-eye view maps, allowing users to interpret driving contexts through freeform natural language queries.
Arxiv 2023 Code / Project
Drive as You Speak: Enabling Human-Like Interaction with Large Language Models in Autonomous Vehicles
DetailsIntroduces an LLM-based framework to process verbal commands from humans and make autonomous driving decisions that satisfy personalized preferences for safety, efficiency, and comfort.
WACVW 2024
Human-Centric Autonomous Systems With LLMs for User Command Reasoning
DetailsProposes leveraging LLMs' reasoning capabilities to infer system requirements from in-cabin users’ commands, using the UCU Dataset.
WACVW 2024 Code
ChatGPT as Your Vehicle Co-Pilot: An Initial Attempt
DetailsDesigns a universal framework embedding LLMs as a vehicle "Co-Pilot" to accomplish specific driving tasks based on human intention and provided information.
TIV 2023
LingoQA: Visual Question Answering for Autonomous Driving
DetailsWhile a benchmark, its focus on video question answering (including action justification and scene description) directly supports the development of systems that can interactively explain their understanding and decisions to humans.
ECCV 2024 Code

Categorization by Technical Route

This section classifies papers based on the core AI techniques or methodologies they employ or innovate upon.

Transformer Architectures & Variants

Many LLMs, VLMs, and VLAs are based on the Transformer architecture, known for its efficacy in handling sequential data and capturing long-range dependencies.

Method Introduction Year Project
ADAPT: Action-aware Driving Caption Transformer
DetailsExplicitly an end-to-end transformer-based architecture. It uses a Video Swin Transformer as the visual encoder and vision-language transformers for text generation and motion prediction.
ICRA 2023 Code
TeLL-Drive: Enhancing Autonomous Driving with Teacher LLM-Guided Deep Reinforcement Learning
DetailsWhile the core "teacher" is an LLM (typically Transformer-based), it guides an attention-based Student DRL policy. Self-attention mechanisms are integral to Transformers.
Arxiv 2025 Project
LanguageMPC: Large Language Models as Decision Makers for Autonomous Driving
DetailsLeverages Large Language Models (LLMs), which are predominantly Transformer-based, for high-level decision-making.
Arxiv 2023
TS-VLM: Text-Guided SoftSort Pooling for Vision-Language Models in Multi-View Driving Reasoning
DetailsWhile focusing on a novel pooling module, it's designed for Vision-Language Models, many of which have Transformer backbones for either vision, language, or fusion. The paper contrasts its pooling with costly attention mechanisms (core to Transformers).
Arxiv 2025
LightEMMA: Lightweight End-to-End Multimodal Model for Autonomous Driving
DetailsEvaluates various state-of-the-art VLMs, the majority of which (e.g., GPT-4o, Gemini, Claude, LLaMA-3.2-Vision, Qwen2.5-VL) are Transformer-based.
Arxiv 2025 Code
OpenDriveVLA: Towards End-to-end Autonomous Driving with Large Vision Language Action Model
DetailsBuilds upon open-source pre-trained large Vision-Language Models (VLMs) and language foundation models, which are typically Transformer architectures.
Arxiv 2025 Code
DriveGPT4: Interpretable End-to-end Autonomous Driving via Large Language Model
Details Based on Multimodal Large Language Models (MLLMs), which are extensions of Transformer-based LLMs.
RAL 2024
Continuously Learning, Adapting, and Improving: A Dual-Process Approach to Autonomous Driving
DetailsThe Analytic Process uses an LLM (Transformer-based), and the Heuristic Process uses a lightweight language model, which could also be Transformer-based.
NeurIPS 2024 Code
AlphaDrive: Unleashing the Power of VLMs in Autonomous Driving via Reinforcement Learning and Reasoning
DetailsProposes a VLM (typically Transformer-based) tailored for high-level planning.
Arxiv 2025 Code
DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models
DetailsUses Vision Transformers (ViT) as the image tokenizer and Qwen (a Transformer-based LLM) as the LLM backbone.
Arxiv 2024 Project
3D-VLA: A 3D Vision-Language-Action Generative World Model
DetailsBuilt on top of a 3D-based Large Language Model (LLM), which implies a Transformer architecture.
Arxiv 2024 Code

Multimodal Fusion Techniques

These works explore or propose novel ways to combine information from different modalities (e.g., vision, language, LiDAR, radar, vehicle states) for improved decision-making. Effective fusion is critical for VLMs and VLAs. The challenge in multimodal fusion lies in effectively aligning and integrating information from disparate sources, such as pixel-level visual data and semantic language features. This becomes even more complex when dealing with multiple sensor inputs (cameras, LiDAR, radar) and dynamic temporal information inherent in driving.

Method Introduction Year Project
TS-VLM: Text-Guided SoftSort Pooling for Vision-Language Models in Multi-View Driving Reasoning
Details Introduces Text-Guided SoftSort Pooling (TGSSP) as a novel method for dynamic and query-aware multi-view visual feature aggregation, aiming for more efficient fusion than costly attention mechanisms.
Arxiv 2025
OpenDriveVLA: Towards End-to-end Autonomous Driving with Large Vision Language Action Model
DetailsProposes a hierarchical vision-language alignment process to project both 2D and 3D structured visual tokens into a unified semantic space, explicitly addressing the modality gap for language-guided trajectory generation.
Arxiv 2025 Code
DriveGPT4: Interpretable End-to-end Autonomous Driving via Large Language Model
DetailsAs an MLLM, it inherently processes and fuses multi-frame video inputs with textual queries to inform its reasoning and control signal prediction. It tokenizes video sequences and text/control signals.
RAL 2024
VLM-E2E: Enhancing End-to-End Autonomous Driving with Multimodal Driver Attention Fusion
DetailsFocuses on multimodal driver attention fusion. It integrates textual representations (attentional cues from VLMs) into Bird's-Eye-View (BEV) features for semantic supervision and introduces a BEV-Text learnable weighted fusion strategy to balance contributions from visual and textual modalities.
Arxiv 2025
DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models
DetailsThe VLM architecture takes multiple image frames as input and uses Qwen as the LLM backbone, implying fusion of visual features with the language model's processing. The DriveVLM-Dual system also fuses perception information from the VLM with a traditional AV 3D perception module.
Arxiv 2024 Project
3D-VLA: A 3D Vision-Language-Action Generative World Model
DetailsLinks 3D perception (point clouds, images, depth) with language and action through a generative world model. It uses a projector to efficiently align LLM output features with diffusion models for generating multimodal goal states.
Arxiv 2024 Code
Holistic Autonomous Driving Understanding by Bird'View Injected Multi-Modal Large Models
DetailsProposes integrating instruction-aware BEV features with existing MLLMs to improve holistic understanding.
CVPR 2024
COMPASS: Contrastive Multimodal Pretraining for Autonomous Systems
DetailsWhile a general pre-training approach for autonomous systems, COMPASS constructs a multimodal graph to connect signals from different modalities (e.g., camera, LiDAR, IMU, odometry) and maps them into factorized spatio-temporal latent spaces for motion and state representation. This is relevant for learning fused representations for decision-making.
IROS 2022 Code

**Prompt Engineering **

Focuses on designing effective prompts to guide foundation models, especially LLMs and VLMs, to elicit desired reasoning processes and outputs for decision-making tasks. Chain-of-Thought (CoT) prompting, which encourages models to generate intermediate reasoning steps, is a prominent technique in this area. This method simulates human-like reasoning by breaking down complex problems into a sequence of manageable steps, leading to more accurate and transparent outputs, particularly for tasks requiring multi-step reasoning.

Method Introduction Year Project
TeLL-Drive: Enhancing Autonomous Driving with Teacher LLM-Guided Deep Reinforcement Learning
DetailsExplicitly uses Chain-of-Thought (CoT) reasoning in its Teacher LLM. The LLM incorporates risk metrics, historical scenario retrieval, and domain heuristics into context-rich prompts to produce high-level driving strategies. The CoT approach helps the model iteratively evaluate collision severity, maneuver consequences, and broader traffic implications, reducing logical inconsistencies.
Arxiv 2025 Project
LightEMMA: Lightweight End-to-End Multimodal Model for Autonomous Driving
DetailsEmploys a Chain-of-Thought (CoT) prompting strategy for its VLM-based autonomous driving agents. This is used to enhance interpretability and facilitate structured reasoning, with the final stage of the CoT explicitly outputting a sequence of predicted control actions.
Arxiv 2025 Code
A Language Agent for Autonomous Driving
DetailsThe reasoning engine of this LLM-based agent is capable of chain-of-thought reasoning, among other capabilities like task planning and motion planning.
COLM 2024 Code / Project
VLM-AD: End-to-End Autonomous Driving through Vision-Language Model Supervision
DetailsLeverages VLMs as teachers to generate reasoning-based text annotations. The annotation process involves prompting a VLM (GPT-4o) with visual input (front-view image with projected future trajectory) and specific instructions to interpret the scenario, generate reasoning, and identify ego-vehicle actions.
Arxiv 2024
Large Language Models Powered Context-aware Motion Prediction in Autonomous Driving
DetailsDesigns and conducts prompt engineering to enable unfine-tuned GPT-4V to comprehend complex traffic scenarios for motion prediction.
IROS 2024 Code
Fine-Tuning Large Vision-Language Models as Decision-Making Agents via Reinforcement Learning
DetailsThis work prompts the VLM to generate chain-of-thought (CoT) reasoning to enable efficient exploration of intermediate reasoning steps that lead to the final text-based action in an RL framework.
NeurIPS 2024 Code / Project
OmniDrive: A Holistic Vision-Language Dataset for Autonomous Driving with Counterfactual Reasoning
DetailsDesigns prompts for GPT-4 to generate coherent Q&A data for driving tasks by using simulated trajectories for counterfactual reasoning to identify key traffic elements and assess outcomes.
Arxiv 2024 Code
Drive-R1: Bridging Reasoning and Planning in VLMs for Autonomous Driving with Reinforcement Learning
DetailsThe article proposes Drive-R1, a domain-specific vision-language model for autonomous driving that integrates structured reasoning and trajectory planning using reinforcement learning. It addresses two key limitations in current VLM-based planning systems: overdependence on textual history over visual input, and poor alignment between reasoning and planning. By combining supervised fine-tuning on a curated chain-of-thought dataset with Group Relative Policy Optimization, Drive-R1 improves trajectory accuracy and safety. Experimental results on nuScenes and DriveLM-nuScenes show marginal gains over strong baselines. While methodologically sound, the contribution is incremental—the core advancement lies more in system integration than in fundamental algorithmic innovation.
Arxiv 2025

Knowledge Distillation & Transfer Learning

Involves transferring knowledge from larger, more capable models (like powerful proprietary LLMs/VLMs) or from diverse data sources to smaller, more efficient models suitable for deployment in autonomous vehicles, or adapting models trained in one domain (e.g., general web text/images) to the specific domain of autonomous driving.

Method Introduction Year Project
AlphaDrive: Unleashing the Power of VLMs in Autonomous Driving via Reinforcement Learning and Reasoning
DetailsEmploys a two-stage planning reasoning training strategy that explicitly involves knowledge distillation. In the first stage, a large model (e.g., GPT-4o) generates a high-quality dataset of planning reasoning processes, which is then used to fine-tune the AlphaDrive model via Supervised Fine-Tuning (SFT), effectively distilling knowledge from the larger model.
Arxiv 2025 Code
Continuously Learning, Adapting, and Improving: A Dual-Process Approach to Autonomous Driving
DetailsThe Analytic Process (System-II), which uses a powerful LLM, accumulates linguistic driving experience. This experience is then transferred to the lightweight language model of the Heuristic Process (System-I) through supervised fine-tuning. This is a form of knowledge transfer from a more capable reasoning system to a more efficient one.
NeurIPS 2024 Code
VLM-AD: End-to-End Autonomous Driving through Vision-Language Model Supervision
DetailsThis method explicitly uses VLMs as "teachers" to automatically generate reasoning-based text annotations. These annotations then serve as supplementary supervisory signals to train end-to-end AD models. This process distills driving reasoning knowledge from the VLM to the student E2E model.
Arxiv 2024
Domain Knowledge Distillation from Large Language Model: An Empirical Study in the Autonomous Driving Domain
DetailsThis paper directly investigates distilling domain knowledge from LLMs (specifically ChatGPT) for the autonomous driving domain, developing a web-based distillation assistant.
ITSC 2023
Mixing Left and Right-Hand Driving Data in a Hierarchical Framework With LLM Generation
DetailsWhile focused on data compatibility, this work uses an LLM-based sample generation method and techniques like MMD to reduce the domain gap between datasets from different driving rule domains (left-hand vs. right-hand drive). This can be seen as a form of domain adaptation or transfer learning for trajectory prediction models.
RAL 2024

📊Datasets and Benchmarks

  • nuScenes : A large-scale multimodal dataset widely used for various AD tasks, including 3D object detection, tracking, and prediction. It features data from cameras, LiDAR, and radar, along with full sensor suites and map information. Several works like LightEMMA , OpenDriveVLA , and GPT-Driver utilize nuScenes for evaluation or data generation. It is also used for tasks like BEV retrieval and dense captioning.
  • BDD-X (Berkeley DeepDrive eXplanation) : This dataset provides textual explanations for driving actions, making it particularly relevant for training and evaluating interpretable AD models. DriveGPT4 and ADAPT are evaluated on BDD-X. It contains video sequences with corresponding control signals and natural language narrations/reasoning.
  • Waymo Open Dataset (WOMD) : A large and diverse dataset with high-resolution sensor data, including LiDAR and camera imagery. Used in works like OmniDrive for Q&A data generation and by LLMs Powered Context-aware Motion Prediction. Also used for scene simulation in ChatSim. WOMD-Reasoning is a language dataset built upon WOMD focusing on interaction descriptions and driving intentions.
  • DriveLM : A benchmark and dataset focusing on driving with graph visual question answering. TS-VLM is evaluated on DriveLM. It aims to assess perception, prediction, and planning reasoning through QA pairs in a directed graph, with versions for CARLA and nuScenes.
  • LingoQA : A benchmark and dataset specifically designed for video question answering in autonomous driving. It contains over 419k QA pairs from 28k unique video scenarios, covering driving reasoning, object recognition, action justification, and scene description. It also proposes the Lingo-Judge evaluation metric.
  • DriveAction : DriveAction leverages real-world driving data actively collected by users of production-level autonomous vehicles to ensure broad and representative scenario coverage, provides high-level discrete behavior labels collected directly from users' actual driving operations, and implements a behavior-based tree-structured evaluation framework that explicitly links vision, language, and behavioral tasks to support comprehensive and task-specific evaluation.
  • CARLA Simulator & Datasets : While a simulator, CARLA is extensively used to generate data and evaluate AD models in closed-loop settings. Works like LeapAD , LMDrive , and LangProp use CARLA for experiments and data collection. DriveLM-Carla is a specific dataset generated using CARLA.
  • Argoverse : A dataset suite with a focus on motion forecasting, 3D tracking, and HD maps. Argoverse 2 is used in challenges like 3D Occupancy Forecasting.
  • KITTI : One of the pioneering datasets for autonomous driving, still used for tasks like 3D object detection and tracking.
  • Cityscapes : Focuses on semantic understanding of urban street scenes, primarily for semantic segmentation.
  • UCU Dataset (In-Cabin User Command Understanding) : Part of the LLVM-AD Workshop, this dataset contains 1,099 labeled user commands for autonomous vehicles, designed for training models to understand human instructions within the vehicle.
  • MAPLM (Large-Scale Vision-Language Dataset for Map and Traffic Scene Understanding) : Also from the LLVM-AD Workshop, MAPLM combines point cloud BEV and panoramic images for rich road scenario images and multi-level scene description data, used for QA tasks.
  • NuPrompt : A large-scale language prompt set based on nuScenes for driving scenes, consisting of 3D object-text pairs, used in Prompt4Driving.
  • nuDesign : A large-scale dataset (2300k sentences) constructed upon nuScenes via a rule-based auto-labeling methodology for 3D dense captioning.
  • LaMPilot : An interactive environment and dataset designed for evaluating LLM-based agents in a driving context, containing scenes for command tracking tasks.
  • DRAMA (Joint Risk Localization and Captioning in Driving) : Provides linguistic descriptions (with a focus on reasons) of driving risks associated with important objects.
  • Rank2Tell : A multimodal ego-centric dataset for ranking importance levels of objects/events and generating textual reasons for the importance.
  • HighwayEnv : A collection of environments for autonomous driving and tactical decision-making research, often used for RL-based approaches and LLM decision-making evaluations (e.g., by DiLu, MTD-GPT).

🧾Other Awesome Lists

These repositories offer broader collections of resources that may overlap with or complement the focus of this list.

Citation

If you find this repository useful, please consider citing this list:

@misc{liu2025lm4decisionofad,
    title = {Awesome-LM-AD-Decision},
    author = {Jiaqi Liu, Chengkai Xu},
    journal = {GitHub repository},
    url = {https://github.com/Jiaaqiliu/Awesome-LM-AD-Decision},
    year = {2025},
}

About

A comprehensive list of awesome research, resources, and tools for leveraging LLMs, VLMs, and VLA models in autonomous driving decision-making and motion planning.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •