A comprehensive list of awesome research, resources, and tools for leveraging Large Language Models (LLMs), Vision-Language Models (VLMs), and Vision-Language-Action (VLA) models in autonomous driving decision-making and motion planning. Contributions are welcome!
- Survey papers
- Research Papers
- Datasets and Benchmarks
- Other Awesome Lists
This section further elaborates on papers based on the primary type of foundation model employed for decision-making.
LLMs are primarily leveraged for their reasoning, planning, and natural language understanding/generation capabilities to guide autonomous driving decisions. A dominant trend in this area is the hybridization or agentic use of LLMs. Pure LLM-driven control is rare due to challenges in real-time performance, safety assurance, and precise numerical output. Instead, LLMs often function at a strategic or tactical level, acting as a "supervisor," "planner," or "reasoner" that guides more traditional or specialized modules like DRL agents or MPC controllers. This hierarchical system leverages the LLM's strengths in high-level reasoning and contextual understanding while offloading operational, real-time aspects to other components. This approach aims to augment specific parts of the AD stack, particularly those requiring human-like commonsense, reasoning about novel situations, or providing interpretable justifications for actions.
Method | Introduction | Year | Project |
---|---|---|---|
TeLL-Drive: Enhancing Autonomous Driving with Teacher LLM-Guided Deep Reinforcement Learning | DetailsThis work proposes a hybrid framework where a "teacher" LLM guides an attention-based "student" Deep Reinforcement Learning (DRL) policy. The LLM utilizes Chain-of-Thought (CoT) reasoning, incorporates risk metrics, and retrieves historical scenarios to produce high-level driving strategies. This approach aims to improve the DRL agent's sample complexity and robustness while ensuring real-time feasibility, a common challenge for LLMs when used in isolation for decision-making. |
Arxiv 2025 | Project |
DSDrive: Distilling Large Language Model for Lightweight End-to-End Autonomous Driving with Unified Reasoning and Planning | DetailsDSDrive is a lightweight end-to-end (E2E) autonomous driving framework that uses knowledge distillation technology to migrate the reasoning capabilities of large visual language models (VLMs) to compact LLM-based multimodal large models, and unifies reasoning and planning through a waypoint-driven dual-head coordination module. DSDrive significantly reduces computing requirements while maintaining high performance, providing an efficient and explainable solution for resource-constrained autonomous driving systems. |
Arxiv 2025 | |
Dilu: A knowledge-driven approach to autonomous driving with large language models | DetailsDiLu proposes a knowledge-driven framework for autonomous driving that combines Reasoning and Reflection modules within an LLM. This enables the system to make decisions based on common-sense knowledge and to continuously evolve its understanding and strategies through experience. |
LCLR 2024 | Code / Project |
Receive, Reason, and React: Drive as You Say with Large Language Models in Autonomous Vehicles | Details |
MITS 2024 | |
LanguageMPC: Large Language Models as Decision Makers for Autonomous Driving | DetailsThis system employs an LLM as a high-level decision-making component, particularly for complex AD scenarios that demand human commonsense understanding. The LLM processes environmental data through scenario encoding, provides action guidance, and adjusts confidence levels. These high-level decisions are then translated into precise parameters for a low-level Model Predictive Controller (MPC), thereby enhancing interpretability and enabling the system to handle complex maneuvers, including multi-vehicle coordination. |
Arxiv 2023 | Project |
A Language Agent for Autonomous Driving | DetailsThis framework positions an LLM as a cognitive agent. The agent has access to a versatile tool library (for perception and prediction tasks), a cognitive memory storing commonsense knowledge and past driving experiences, and a reasoning engine. The reasoning engine is capable of chain-of-thought reasoning, task planning, motion planning, and self-reflection, showcasing a more integrated and sophisticated agentic approach to autonomous driving. |
Arxiv 2023 | Project |
Driving with LLMs: Fusing Object-Level Vector Modality for Explainable Autonomous Driving | Details |
LCRA 2024 | Code |
Drive Like a Human: Rethinking Autonomous Driving with Large Language Models | Details |
Arxiv 2023 | Code |
Empowering autonomous driving with large language models: A safety perspective | DetailsThis research explores the integration of LLMs as intelligent decision-makers within the behavioral planning module of AD systems. A key feature is the augmentation of LLMs with a safety verifier shield, which facilitates contextual safety learning. The paper presents studies on an adaptive LLM-conditioned MPC and an LLM-enabled interactive behavior planning scheme using a state machine, demonstrating improved safety metrics. |
ICLR 2024 | |
Driving with Regulation: Interpretable Decision-Making for Autonomous Vehicles with Retrieval-Augmented Reasoning via LLM | DetailsThis framework introduces an interpretable decision-maker that leverages a Traffic Regulation Retrieval (TRR) Agent, built upon Retrieval-Augmented Generation (RAG). This agent automatically retrieves relevant traffic rules and guidelines from extensive documents. An LLM-powered reasoning module then interprets these rules, differentiates between mandatory regulations and safety guidelines, and assesses actions for legal compliance and safety, enhancing transparency. |
Arxiv 2024 | |
Towards Human-Centric Autonomous Driving: AFast-Slow Architecture Integrating Large LanguageModel Guidance with Reinforcement Learning | Details |
Arxiv 2025 | Project |
PADriver: Towards Personalized Autonomous Driving | Details |
Arxiv 2025 | |
CoT-Drive: Efficient Motion Forecasting for Autonomous Driving with LLMs and Chain-of-Thought Prompting | Details |
Arxiv 2025 | |
LearningFlow: Automated Policy Learning Workflow for Urban Driving with Large Language Models | Details |
Arxiv 2025 | |
Distilling Multi-modal Large Language Models for Autonomous Driving | Details |
Arxiv 2025 |
VLMs bring visual understanding to the decision-making process, allowing for richer interpretations of the driving scene and enabling actions based on both visual percepts and linguistic instructions or reasoning. A key challenge for VLMs in AD decision-making is bridging the gap between their often 2D-centric visual-linguistic understanding and the precise, 3D spatio-temporal reasoning essential for safe driving. Many current VLMs are adapted from models pre-trained on large, static 2D image-text datasets , and they can struggle with the dynamic, three-dimensional nature of real-world driving scenarios. This means that while they might excel at describing a scene, their practical performance in actual driving tasks can be concerning. Effective VLMs for AD decision-making will likely need to incorporate stronger 3D visual backbones, improved mechanisms for temporal modeling beyond simple frame concatenation, and potentially integrate more structured environmental representations like Bird's-Eye-View (BEV) maps or scene graphs directly into their reasoning processes.
Method | Introduction | Year | Project |
---|---|---|---|
LightEMMA: Lightweight End-to-End Multimodal Model for Autonomous Driving | DetailsAt present, the research on combining autonomous driving with visual language models (VLMs) is becoming more and more popular. VLMs have proved their important role in autonomous driving. This paper introduces a lightweight end-to-end multimodal model LightEMMA for autonomous driving, which can integrate and evaluate current commercial and open source models to study the role and limitations of VLMs in driving tasks, so as to promote the further development of VLMs in autonomous driving. |
Arxiv 2025 | Code |
Extending Large Vision-Language Model for Diverse Interactive Tasks in Autonomous Driving | Details |
Arxiv 2025 | Code |
X-Driver: Explainable Autonomous Driving with Vision-Language Models | Details |
Arxiv 2025 | |
AlphaDrive: Unleashing the Power of VLMs in Autonomous Driving via Reinforcement Learning and Reasoning | DetailsAlphaDrive is a VLM tailored for high-level planning in autonomous driving. It integrates a Group Relative Policy Optimization (GRPO)-based reinforcement learning strategy with a two-stage reasoning training approach (Supervised Fine-Tuning followed by RL) to boost planning performance and training efficiency. |
Arxiv 2025 | Code |
Sce2DriveX: A Generalized MLLM Framework for Scene-to-Drive Learning | Details |
Arxiv 2025 | |
VLM-MPC: Vision Language Foundation Model (VLM)-Guided Model Predictive Controller (MPC) for Autonomous Driving | Details |
ICML 2025 | |
VLM-E2E: Enhancing End-to-End Autonomous Driving with Multi modal Driver Attention Fusion | DetailsThis framework aims to enhance end-to-end autonomous driving by using VLMs to provide driver attentional cues. It integrates textual representations into Bird's-Eye-View (BEV) features for semantic supervision, enabling the model to learn richer feature representations that capture driver attentional semantics. It also introduces a BEV-Text learnable weighted fusion strategy. |
Arxiv 2025 | |
VLM-Assisted Continual learning for Visual Question Answering in Self-Driving | Details |
Arxiv 2025 | |
WiseAD: Knowledge Augmented End-to-End Autonomous Driving with Vision-Language Model | Details |
Arxiv 2024 | Code / Project |
CALMM-Drive: Confidence-Aware Autonomous Driving with Large Multi modal Model | Details |
Arxiv 2024 | |
OpenEMMA: Open-Source Multimodal Model for End-to-End Autonomous Driving | Details |
WACV 2025 | Code |
VLM-AD: End-to-End Autonomous Driving through Vision-Language Model Supervision | DetailsThis method positions VLMs as "teachers" to generate reasoning-based text annotations and structured action labels. These annotations serve as supplementary supervisory signals for training end-to-end AD models, aiming to improve their understanding beyond simple trajectory labels without requiring the VLM at inference time. |
Arxiv 2024 | |
TS-VLM: Text-Guided SoftSort Pooling for Vision-Language Models in Multi-View Driving Reasoning | DetailsThis work introduces a lightweight VLM designed for efficient multi-view reasoning in autonomous driving. It features a novel Text-Guided SoftSort Pooling (TGSSP) module that dynamically ranks and fuses visual features from multiple camera views based on the semantics of input queries. This query-aware aggregation aims to improve contextual accuracy and reduce computational overhead, making it more practical for real-time deployment. |
Arxiv 2025 | |
LightEMMA: Lightweight End-to-End Multimodal Model for Autonomous Driving | DetailsThis work introduces a lightweight VLM designed for efficient multi-view reasoning in autonomous driving. It features a novel Text-Guided SoftSort Pooling (TGSSP) module that dynamically ranks and fuses visual features from multiple camera views based on the semantics of input queries. This query-aware aggregation aims to improve contextual accuracy and reduce computational overhead, making it more practical for real-time deployment. |
Arxiv 2025 | Code |
DriveGPT4: Interpretable End-to-end Autonomous Driving via Large Language Model | DetailsThis Multimodal Large Language Model (MLLM) is designed for interpretable end-to-end autonomous driving. It processes multi-frame video inputs and textual queries to interpret vehicle actions, provide relevant reasoning, and predict low-level control signals. A bespoke visual instruction tuning dataset aids its capabilities. |
RAL 2024 | Project |
ADAPT: Action-aware Driving Caption Transformer | DetailsWhile primarily a transformer architecture, ADAPT functions similarly to a VLM by generating natural language narrations and reasoning for driving actions based on video input. It jointly trains the driving captioning task and the vehicular control prediction task, enhancing interpretability in decision-making. |
ICRA 2023 | Code |
Continuously Learning, Adapting, and Improving: A Dual-Process Approach to Autonomous Driving | DetailsThis system uses a VLM for scene understanding, specifically to provide descriptions of critical objects that may influence driving decisions. This visual-linguistic understanding then feeds into a dual-process decision-making module composed of an Analytic LLM and a Heuristic lightweight language model. |
NeurIPS 2024 | Code |
DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models | DetailsThis system leverages VLMs for enhanced scene understanding and planning capabilities. It also proposes DriveVLM-Dual, a hybrid system that combines the strengths of DriveVLM with traditional AD pipelines to address VLM limitations in spatial reasoning and computational requirements, particularly for long-tail critical objects. |
Arxiv 2024 | Project |
LingoQA: Visual Question Answering for Autonomous Driving | DetailsThis project introduces a benchmark and a large dataset (419.9k QA pairs from 28K unique video scenarios) for video question answering specifically in the autonomous driving domain. It focuses on evaluating a VLM's ability to perform reasoning, justify actions, and describe scenes, and proposes the Lingo-Judge metric for evaluation. |
ECCV 2024 | Code |
Training-Free Open-Ended Object Detection and Segmentation via Attention as Prompts | DetailsThis training-free framework combines a VLM (for generalized object recognition, e.g., recognizing rare objects in AD scenarios) with the Segment-Anything Model (SAM, for generalized object localization). It uses attention maps from the VLM as prompts for SAM to address open-ended object detection and segmentation, which is crucial for robust perception feeding into decision-making systems. |
NeurIPS 2024 | |
ORION: A Holistic End-to-End Autonomous Driving Framework by Vision-Language Instructed Action Generation | Details |
Arxiv 2025 | Code |
Generative Planning with 3D-vision Language Pre-training for End-to-End Autonomous Driving | Details |
Arxiv 2025 | |
FutureSightDrive: Visualizing Trajectory Planning with Spatio-Temporal CoT for Autonomous Driving | DetailsThe paper presents FSDrive, a framework that enables autonomous vehicles to perform visual reasoning for trajectory planning using a spatio-temporal Chain-of-Thought (CoT). Instead of relying on abstract text-based logic, FSDrive uses a visual language model to generate future scene representations as images, capturing spatial and temporal dynamics. It introduces a lightweight pretraining method to activate image generation in existing models and employs these visual predictions as intermediate reasoning steps. |
Arxiv 2025 | Code |
Drive-R1: Bridging Reasoning and Planning in VLMs for Autonomous Driving with Reinforcement Learning | DetailsThe article proposes Drive-R1, a domain-specific vision-language model for autonomous driving that integrates structured reasoning and trajectory planning using reinforcement learning. It addresses two key limitations in current VLM-based planning systems: overdependence on textual history over visual input, and poor alignment between reasoning and planning. By combining supervised fine-tuning on a curated chain-of-thought dataset with Group Relative Policy Optimization, Drive-R1 improves trajectory accuracy and safety. Experimental results on nuScenes and DriveLM-nuScenes show marginal gains over strong baselines. While methodologically sound, the contribution is incremental—the core advancement lies more in system integration than in fundamental algorithmic innovation. |
Arxiv 2025 |
VLAs aim to create more generalist agents that can perceive, reason, and act, often in an end-to-end fashion. For autonomous driving, this means models that can take raw sensor data and high-level goals to produce driving actions. While VLAs offer the promise of true end-to-end decision-making by unifying perception, reasoning, and action generation , their application in safety-critical autonomous driving faces a significant hurdle: ensuring the reliability and verifiability of actions generated by these complex, often black-box, generative models. The potential for "hallucinated" or unexpected outputs from generative models is a recurring concern. A major research direction for VLAs in AD will involve developing methods for safety validation, uncertainty quantification, and robust fallback mechanisms. This might include hybrid approaches where VLA outputs are monitored or constrained by traditional safety layers, or novel training paradigms that explicitly optimize for safety and predictability.
Method | Introduction | Year | Project |
---|---|---|---|
OpenDriveVLA: Towards End-to-end Autonomous Driving with Large Vision Language Action Model | DetailsThis is an end-to-end VLA model specifically designed for autonomous driving. It generates reliable driving actions conditioned on 3D environmental perception, ego vehicle states, and driver commands. Key methodological contributions include a hierarchical vision-language alignment process to bridge the modality gap between driving visual representations and language embeddings, and an autoregressive agent-env-ego interaction process to ensure spatially and behaviorally informed trajectory planning. |
Arxiv 2025 | Code |
DriveMoE: Mixture-of-Experts for Vision-Language-Action Model in End-to-End Autonomous Driving | Details |
Arxiv 2025 | Project |
Impromptu VLA: Open Weights and Open Data for Driving Vision-Language-Action Models | Details |
Arxiv 2025 | Code / Project |
DiffVLA: Vision-Language Guided Diffusion Planning for Autonomous Driving | Details |
Arxiv 2025 | |
AutoVLA: A Vision-Language-Action Model for End-to-End Autonomous Driving with Adaptive Reasoning and Reinforcement Fine-Tuning | DetailsThe article presents AutoVLA, an automated framework for training vision-language agents (VLAs) in autonomous driving. It generates training data by combining a rule-based planner, a visual perception model, and GPT-4 for reasoning, eliminating the need for manual annotation. AutoVLA fine-tunes a VLA using this data and further improves it with a reward model for trajectory quality. The resulting agent achieves competitive performance on the DriveLM benchmark. The main contribution is a scalable, automated pipeline for training driving VLAs, emphasizing practicality over architectural novelty. |
Arxiv 2025 | Code / Project |
This section groups papers based on overarching research themes and objectives within AD decision-making.
These models aim to learn a direct mapping from sensor inputs to driving actions or high-level plans, often minimizing handcrafted intermediate representations. The pursuit of end-to-end (E2E) models in autonomous driving is driven by the ambition to reduce error propagation inherent in modular pipelines and to potentially uncover novel, more effective driving strategies that might not emerge from separately optimized components. However, the "black box" nature and significant data requirements of traditional E2E deep learning models have been persistent challenges. The integration of LLMs, VLMs, and VLAs into E2E frameworks represents an effort to mitigate these issues by infusing these models with enhanced reasoning capabilities, better generalization from pre-training, and avenues for interpretability. This suggests a future where E2E AD systems are not purely opaque mappings but incorporate a semantic layer or reasoning backbone provided by foundation models, thus addressing key criticisms of earlier E2E approaches.
Method | Introduction | Year | Project |
---|---|---|---|
LightEMMA: Lightweight End-to-End Multimodal Model for Autonomous Driving | DetailsThis framework is specifically designed for evaluating various VLMs in an end-to-end fashion for autonomous driving planning tasks. It provides an open-source baseline workflow for integrating VLMs into E2E planning, enabling rapid prototyping. |
Arxiv 2025 | Code |
DriveGPT4: Interpretable End-to-end Autonomous Driving via Large Language Model | DetailsDriveGPT4 is presented as an interpretable end-to-end autonomous driving system based on LLMs. It processes multi-frame video inputs and textual queries, predicts low-level vehicle control signals, and offers reasoning for its actions. |
RAL 2024 | Project |
OpenDriveVLA: Towards End-to-end Autonomous Driving with Large Vision Language Action Model | DetailsThis VLA model is explicitly designed for end-to-end autonomous driving. It generates reliable driving trajectories conditioned on multimodal inputs including 3D environmental perception, ego vehicle state, and driver commands. |
Arxiv 2025 | Code |
ADAPT: Action-aware Driving Caption Transformer | DetailsADAPT proposes an end-to-end transformer-based architecture that jointly trains a driving captioning task and a vehicular control prediction task through a shared video representation, aiming for user-friendly narration and reasoning. |
ICRA 2023 | Code |
LMDrive: Closed-Loop End-to-End Driving with Large Language Models | DetailsThis work focuses on closed-loop end-to-end driving specifically with large language models, indicating a direct application of LLMs in the E2E driving pipeline. |
CVPR 2024 | Code |
VLM-E2E: Enhancing End-to-End Autonomous Driving with Multi modal Driver Attention Fusion | DetailsThis research aims to enhance end-to-end autonomous driving by using VLMs to provide attentional cues and fusing multimodal information (BEV and text features) for semantic supervision. |
Arxiv 2025 | |
VLM-AD: End-to-End Autonomous Driving through Vision-Language Model Supervision | DetailsThis method leverages VLMs as teachers to provide reasoning-based text annotations, which serve as supplementary supervisory signals to train end-to-end AD pipelines, extending beyond standard trajectory labels. |
Arxiv 2024 | |
GenAD: Generative End-to-End Autonomous Driving | DetailsGenAD models autonomous driving as a trajectory generation problem, adopting an instance-centric scene tokenizer and a variational autoencoder for trajectory prior modeling in an E2E setup. |
Arxiv 2024 | Code |
ORION: A Holistic End-to-End Autonomous Driving Framework by Vision-Language Instructed Action Generation | Details |
Arxiv 2025 | Code |
Distilling Multi-modal Large Language Models for Autonomous Driving | Details |
Arxiv 2025 | |
Generative Planning with 3D-vision Language Pre-training for End-to-End Autonomous Driving | Details |
Arxiv 2025 | |
Impromptu VLA: Open Weights and Open Data for Driving Vision-Language-Action Models | Details |
Arxiv 2025 | Code / Project |
DiffVLA: Vision-Language Guided Diffusion Planning for Autonomous Driving | Details |
Arxiv 2025 | |
AutoVLA: A Vision-Language-Action Model for End-to-End Autonomous Driving with Adaptive Reasoning and Reinforcement Fine-Tuning | DetailsThe article presents AutoVLA, an automated framework for training vision-language agents (VLAs) in autonomous driving. It generates training data by combining a rule-based planner, a visual perception model, and GPT-4 for reasoning, eliminating the need for manual annotation. AutoVLA fine-tunes a VLA using this data and further improves it with a reward model for trajectory quality. The resulting agent achieves competitive performance on the DriveLM benchmark. The main contribution is a scalable, automated pipeline for training driving VLAs, emphasizing practicality over architectural novelty. |
Arxiv 2025 | Code / Project |
Focuses on making the decision-making processes of AD systems transparent and understandable to humans. The integration of LLMs and VLMs is pushing XAI in autonomous driving beyond simple attention maps or feature visualizations towards generating natural language explanations and justifications that are genuinely comprehensible to human users, including passengers and regulators. This is crucial for building public trust, facilitating regulatory approval, and enabling more effective human-AI collaboration in the driving context. The challenge, however, lies in ensuring that these generated explanations are faithful to the model's actual decision-making process and are not merely plausible-sounding rationalizations generated post-hoc. Future work will need to concentrate on methods that tightly couple the reasoning and explanation generation with the core decision logic of the AD system.
Method | Introduction | Year | Project |
---|---|---|---|
LanguageMPC: Large Language Models as Decision Makers for Autonomous Driving | DetailsThe use of LLM-generated high-level decisions is explicitly stated to improve interpretability in complex autonomous driving scenarios. The system aims to make the "thinking process" visible. |
Arxiv 2023 | |
LightEMMA: Lightweight End-to-End Multimodal Model for Autonomous Driving | DetailsThis VLM framework employs Chain-of-Thought (CoT) prompting. CoT is used to enhance interpretability and facilitate structured reasoning within the VLM-based driving agents, allowing the model to output its reasoning steps. |
Arxiv 2025 | Code |
DriveGPT4: Interpretable End-to-end Autonomous Driving via Large Language Model | DetailsA primary goal of DriveGPT4 is to develop an interpretable end-to-end autonomous driving solution. The MLLM is designed to interpret vehicle actions, offer pertinent reasoning, and address user queries, thereby making the system's behavior understandable. |
RAL 2024 | Project |
ADAPT: Action-aware Driving Caption Transformer | DetailsThis framework provides user-friendly natural language narrations and reasoning for each decision-making step of autonomous vehicular control and action. For example, it can output "[Action narration:] the car pulls over to the right side of the road, because the car is parking". |
ICRA 2023 | Code |
LingoQA: Visual Question Answering for Autonomous Driving | DetailsBy benchmarking video question answering, LingoQA facilitates the development of models that can justify actions and describe scenes in natural language, directly contributing to the explainability of driving decisions. |
ECCV 2024 | Code |
Driving with Regulation: Interpretable Decision-Making for Autonomous Vehicles with Retrieval-Augmented Reasoning via LLM | DetailsThe reasoning module in this framework is explicitly designed to be interpretable, enhancing transparency in how traffic rules are identified, interpreted, and applied to driving decisions. |
Arxiv 2024 | |
Explainable Artificial Intelligence for Autonomous Driving: A Comprehensive Overview and Field Guide for Future Research Directions | DetailsThese resources provide a broader context on XAI in autonomous vehicles. They emphasize that XAI serves to bridge complex technological capabilities with human understanding, addressing safety assurance, regulatory compliance, and public trust. XAI can provide real-time justifications for actions (e.g., sudden braking) and post-hoc explanations (e.g., visual heat maps, natural language descriptions). |
Arxiv 2024 | |
RAG-Driver: Generalisable Driving Explanations with Retrieval-Augmented In-Context Learning in Multi-Modal Large Language Model | DetailsThis work aims to provide generalisable driving explanations by employing retrieval-augmented in-context learning within Multi-Modal Large Language Models. |
Arxiv 2024 | Code / Project |
Addresses the paramount challenge of ensuring safety, especially in rare, unforeseen (long-tail) situations where traditional, purely data-driven systems often falter due to lack of representative data. Effectively handling these scenarios requires more than just scaling up models; it demands robust reasoning, the integration of explicit and implicit knowledge (including safety rules and commonsense), and rigorous validation methodologies. The integration of LLMs and VLMs offers a promising avenue by leveraging their potential for abstract reasoning and broad knowledge. However, the inherent risk of these models generating incorrect or "hallucinated" outputs in critical situations necessitates a cautious approach. Future progress in this area will likely depend on hybrid architectures that combine the generalization capabilities of foundation models with explicit safety layers or verifiers, methods for effectively injecting structured safety knowledge (like traffic laws or physical constraints) into the decision-making loop, and the development of advanced simulation and testing protocols specifically designed to probe behavior in diverse long-tail scenarios and rigorously evaluate safety.
Method | Introduction | Year | Project |
---|---|---|---|
LanguageMPC: Large Language Models as Decision Makers for Autonomous Driving | DetailsAims to improve generalization to rare events by leveraging the commonsense reasoning capabilities of LLMs, which is crucial for safety in unexpected situations. |
Arxiv 2023 | |
OpenDriveVLA: Towards End-to-end Autonomous Driving with Large Vision Language Action Model | DetailsThis VLA model specifically targets the challenges of limited generalization to long-tail scenarios and insufficient understanding of high-level semantics within complex driving scenes, which are critical for safe decision-making. |
Arxiv 2025 | Code |
Generative AI for Autonomous Driving: Frontiers and Opportunities | DetailsThis survey identifies comprehensive generalization across rare cases and the development of robust evaluation and safety checks as key obstacles and future opportunities for GenAI in autonomous driving. |
Arxiv 2025 | Project |
Empowering Autonomous Driving with Large Language Models: A Safety Perspective | DetailsThis work directly focuses on enhancing safety by proposing methodologies that employ LLMs as intelligent decision-makers in behavioral planning, augmented with a safety verifier shield for contextual safety learning. |
Arxiv 2024 | |
Tokenize the World into Object-level Knowledge to Address Long-tail Events in Autonomous Driving | DetailsThis MM-LLM is designed to address long-tail events by tokenizing the world into object-level knowledge, enabling better utilization of an LLM's reasoning capabilities for enhanced autonomous vehicle planning in such scenarios. |
Arxiv 2024 | |
VLM-AD: End-to-End Autonomous Driving through Vision-Language Model Supervision | DetailsThis method addresses the limitations of E2E models in handling diverse real-world scenarios by using VLMs as teachers to provide reasoning-based supervision, which can help in understanding and reacting to less common situations. |
Arxiv 2024 |
Utilizes Reinforcement Learning (RL) techniques, often guided or enhanced by foundation models (LLMs/VLMs), to learn optimal driving policies through interaction with simulated or real-world environments. The synergy between foundation models and RL represents a powerful emerging trend in autonomous driving. Foundation models can address key RL challenges, such as high sample complexity and difficult reward design, by providing high-level guidance, superior state representations, or even by directly shaping the reward function itself. This combination can lead to more data-efficient learning of complex driving behaviors, improved generalization to novel scenarios by leveraging the pre-trained knowledge embedded in foundation models, and more interpretable reward structures, especially if derived from linguistic goals. The primary research focus in this area will likely be on optimizing the structure of this synergy—for example, determining whether the LLM should act as a planner for an RL agent, a reward shaper, or if end-to-end RL fine-tuning of a VLM is the most effective approach.
Method | Introduction | Year | Project |
---|---|---|---|
TeLL-Drive: Enhancing Autonomous Driving with Teacher LLM-Guided Deep Reinforcement Learning | DetailsThis framework explicitly combines a "teacher" LLM with a "student" Deep Reinforcement Learning (DRL) agent. The LLM, using Chain-of-Thought reasoning and incorporating risk metrics and historical data, guides the DRL policy. This guidance aims to accelerate policy convergence and boost robustness across diverse driving conditions, mitigating DRL's high sample complexity and the LLM's real-time decision-making challenges. |
Arxiv 2025 | Project |
AlphaDrive: Unleashing the Power of VLMs in Autonomous Driving via Reinforcement Learning and Reasoning | DetailsAlphaDrive proposes a VLM-based reinforcement learning and reasoning framework specifically for autonomous driving planning. It introduces four Group Relative Policy Optimization (GRPO)-based RL rewards tailored for planning (planning accuracy, action-weighted, planning diversity, planning format) and employs a two-stage planning reasoning training strategy that combines Supervised Fine-Tuning (SFT) with RL. This approach is shown to significantly improve planning performance and training efficiency. |
Arxiv 2025 | Project |
LORD: Large Models based Opposite Reward Design for Autonomous Driving | DetailsThis work introduces a novel approach to reward design in RL for autonomous driving. Instead of defining desired linguistic goals (e.g., "drive safely"), which can be ambiguous, LORD leverages large pretrained models (LLMs/VLMs) to focus on concrete undesired linguistic goals (e.g., "collision"). This allows for more efficient use of these models as zero-shot reward models, aiming for safer and enhanced autonomous driving with improved generalization. |
WACV 2024 | |
NaVILA: Legged Robot Vision-Language-Action Model for Navigation | DetailsWhile the high-level decision-making in NaVILA is handled by a VLA generating linguistic commands, the low-level locomotion policy responsible for executing these commands is trained using RL. This demonstrates a hierarchical approach where RL handles the dynamic execution based on VLA guidance. |
RSS 2025 | Project |
Towards Human-Centric Autonomous Driving: A Fast-Slow Architecture Integrating Large Language Model Guidance with Reinforcement Learning | DetailsThe article presents a fast-slow autonomous driving framework that combines an LLM for interpreting high-level user instructions with an RL agent for real-time control. The LLM generates structured directives based on context and memory, while the RL module ensures safe execution under dynamic conditions. Experiments demonstrate improved safety, comfort, and user alignment over baseline methods. |
Arxiv 2025 | |
Fine-Tuning Large Vision-Language Models as Decision-Making Agents via Reinforcement Learning | DetailsThis research explores fine-tuning VLMs with RL for general multi-step goal-directed tasks. The VLM generates Chain-of-Thought reasoning leading to a text-based action, which is then parsed and executed in an interactive environment to obtain task rewards for RL-based fine-tuning. This general methodology is highly relevant for training decision-making agents in autonomous driving. |
NeurIPS 2024 | Code / Project |
Drive-R1: Bridging Reasoning and Planning in VLMs for Autonomous Driving with Reinforcement Learning | DetailsThe article proposes Drive-R1, a domain-specific vision-language model for autonomous driving that integrates structured reasoning and trajectory planning using reinforcement learning. It addresses two key limitations in current VLM-based planning systems: overdependence on textual history over visual input, and poor alignment between reasoning and planning. By combining supervised fine-tuning on a curated chain-of-thought dataset with Group Relative Policy Optimization, Drive-R1 improves trajectory accuracy and safety. Experimental results on nuScenes and DriveLM-nuScenes show marginal gains over strong baselines. While methodologically sound, the contribution is incremental—the core advancement lies more in system integration than in fundamental algorithmic innovation. |
Arxiv 2025 |
Involves building internal representations of the environment and its dynamics to predict future states and plan actions accordingly. Foundation models, particularly generative models like Diffusion Models, GANs, LLMs, and VLMs, are becoming key enablers for constructing more powerful and versatile world models for autonomous driving. These advanced world models are evolving beyond simple state prediction to encompass rich semantic understanding, the generation of diverse future scenarios (including long-tail events), and even interaction with language-based instructions or goals. This fusion of generative world models with the reasoning capabilities of LLMs/VLMs can lead to AD systems that perform sophisticated "what-if" analyses, anticipate a broader range of future possibilities, and plan more robustly by "imagining" the consequences of actions within a semantically rich, simulated future. This also has profound implications for creating highly realistic and controllable simulation environments for training and testing AD systems.
Method | Introduction | Year | Project |
---|---|---|---|
3D-VLA: A 3D Vision-Language-Action Generative World Model | DetailsThis framework explicitly incorporates a generative world model within its 3D vision-language-action architecture. It allows the model to "imagine" future scenarios by predicting goal images and point clouds, which then guide action planning. The model is built on a 3D-LLM and uses interaction tokens to engage with the environment. |
ICML 2024 | Code |
DriveDreamer-2: LLM-Enhanced World Models for Diverse Driving Video Generation | DetailsDriveDreamer utilizes a powerful diffusion model to construct a comprehensive representation of the driving environment. It can generate future driving videos and driving policies, effectively acting as a multimodal world model. DriveDreamer-2 enhances this by incorporating an LLM to generate user-defined driving videos with improved temporal and spatial coherence. |
Arxiv 2024 | Project |
GAIA-1: A Generative World Model for Autonomous Driving | DetailsDeveloped by Wayve, GAIA-1 is a generative world model that leverages video, text, and action inputs to generate realistic driving scenarios. It serves as a valuable neural simulator for autonomous driving. |
Arxiv 2023 | |
Driving into the Future: Multiview Visual Forecasting and Planning with World Model for Autonomous Driving | DetailsThis is a multiview world model capable of generating high-quality, controllable, and consistent multiview videos in autonomous driving scenes. It also explores applications in end-to-end planning. |
Arxiv 2023 | Code / Project. |
TrafficBots: Towards World Models for Autonomous Driving Simulation and Motion Prediction | DetailsThis research aims towards developing world models specifically for autonomous driving simulation and motion prediction. |
ICRA 2023 | Code |
UniWorld: Autonomous Driving Pre-training via World Models | DetailsFocuses on autonomous driving pre-training via world models, suggesting that learning a world model can provide a foundational understanding for downstream AD tasks. |
Arxiv 2023 | |
Occ-LLM: Enhancing Autonomous Driving with Occupancy-Based Large Language Models | DetailsWhile not explicitly a "world model" in the generative sense of predicting future full scenes, Occ-LLM uses occupancy representations with LLMs, which is a form of modeling the current state of the world for decision-making. |
ICRA 2025 |
Decomposes the complex driving task into multiple levels of abstraction, often with foundation models handling higher-level reasoning and planning, and other specialized modules executing lower-level control actions. This approach mirrors human cognitive strategies where high-level goals are broken down into manageable sub-tasks. In autonomous driving, this means LLMs or VLMs might determine a strategic maneuver (e.g., "prepare to change lanes and overtake"), which is then translated into a sequence of tactical actions (e.g., check mirrors, signal, adjust speed, steer) executed by a more traditional planner or controller. This layered approach allows for leveraging the strengths of foundation models in complex reasoning and language understanding, while relying on established methods for precise, real-time vehicle control, potentially offering a more robust and interpretable path to autonomy.
Method | Introduction | Year | Project |
---|---|---|---|
OpenDriveVLA: Towards End-to-end Autonomous Driving with Large Vision Language Action Model | DetailsThis VLA model employs a hierarchical vision-language alignment process. While aiming for end-to-end action generation, its internal mechanisms likely involve different levels of representation and processing for perception, reasoning, and action generation, contributing to spatially and behaviorally informed trajectory planning. |
Arxiv 2025 | Code |
NaVILA: Legged Robot Vision-Language-Action Model for Navigation | DetailsThis is a prime example of hierarchical control in a VLA framework. The high-level VLM processes visual input and natural language instructions to generate mid-level actions expressed in natural language (e.g., "turn right 30 degrees," "move forward 75cm"). These linguistic commands are then interpreted and executed by a separate low-level visual locomotion policy trained with reinforcement learning. This decouples complex reasoning from real-time motor control. |
RSS 2025 | Project |
LanguageMPC: Large Language Models as Decision Makers for Autonomous Driving | DetailsThis system uses an LLM for high-level decision-making (scenario encoding, action guidance). These high-level textual decisions are then translated into precise mathematical representations and parameters that guide a low-level Model Predictive Controller (MPC) responsible for the actual driving commands. This clearly separates strategic decision-making from operational control. |
Arxiv 2023 | |
TeLL-Drive: Enhancing Autonomous Driving with Teacher LLM-Guided Deep Reinforcement Learning | DetailsHere, a "teacher" LLM produces high-level driving strategies through chain-of-thought reasoning. These strategies then guide an attention-based "student" DRL policy, which handles the final decision-making and action execution. This represents a hierarchical guidance mechanism. |
Arxiv 2025 | Project |
DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models | DetailsDriveVLM integrates reasoning modules for scene description, scene analysis, and hierarchical planning. The DriveVLM-Dual system further exemplifies this by using low-frequency motion plans from the VLM as initial plans for a faster, traditional refining planner. |
Arxiv 2024 | Project |
Continuously Learning, Adapting, and Improving: A Dual-Process Approach to Autonomous Driving | DetailsThis system incorporates a dual-process decision-making module. The Analytic Process (System-II, using an LLM) performs thorough analysis and reasoning, accumulating experience. This experience is then transferred to a lightweight Heuristic Process (System-I) for swift, empirical decision-making. This can be seen as a cognitive hierarchy. |
NeurIPS 2024 | Code |
Mixing Left and Right-Hand Driving Data in a Hierarchical Framework With LLM Generation | DetailsWhile focused on data compatibility for trajectory prediction, this work proposes a hierarchical framework, suggesting that different levels of processing or adaptation might be needed when dealing with complex data sources, which can inform hierarchical planning approaches. |
RAL 2024 |
This section organizes papers based on the specific aspect of the autonomous driving decision-making pipeline they primarily address.
These works focus on how enhanced perception, often through LLMs/VLMs, directly informs or enables better downstream decision-making. This involves not just detecting objects but understanding their context, relationships, and potential impact on driving strategy.
Method | Introduction | Year | Project |
---|---|---|---|
TS-VLM: Text-Guided SoftSort Pooling for Vision-Language Models in Multi-View Driving Reasoning | DetailsImproves multi-view driving reasoning by dynamically fusing visual features based on text queries. This enhanced scene understanding directly supports more informed decision-making by providing better contextual accuracy from multiple viewpoints. |
Arxiv 2025 | |
OpenDriveVLA: Towards End-to-end Autonomous Driving with Large Vision Language Action Model | DetailsConditions driving actions on 3D environmental perception, ego vehicle states, and driver commands. The hierarchical vision-language alignment projects 2D and 3D visual tokens into a unified semantic space, enabling perception to directly guide trajectory generation. |
Arxiv 2025 | Code |
DriveGPT4: Interpretable End-to-end Autonomous Driving via Large Language Model | DetailsProcesses multi-frame video inputs to interpret vehicle actions and offer reasoning. The decision to predict control signals is directly informed by its multimodal understanding of the visual scene and textual queries. |
RAL 2024 | |
Continuously Learning, Adapting, and Improving: A Dual-Process Approach to Autonomous Driving | DetailsThe VLM component is crucial for scene understanding, providing descriptions of critical objects that influence driving decisions. This perception output is the direct input to the dual-process decision-making module. |
NeurIPS 2024 | Code |
DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models | DetailsLeverages VLMs for enhanced scene understanding (scene description, scene analysis) which then feeds into its hierarchical planning modules. The perception of long-tail critical objects is a key focus. |
Arxiv 2024 | Project |
Training-Free Open-Ended Object Detection and Segmentation via Attention as Prompts | DetailsCombines VLM recognition with SAM segmentation for open-ended object detection. While primarily a perception method, accurate detection and localization of all objects, including rare ones, is fundamental for safe decision-making. |
NeurIPS 2024 | |
VLM-E2E: Enhancing End-to-End Autonomous Driving with Multimodal Driver Attention Fusion | DetailsIntegrates textual representations (driver attentional cues from VLMs) into Bird's-Eye-View (BEV) features for semantic supervision, enabling the model to learn richer feature representations that explicitly capture driver's attentional semantics, directly impacting driving decisions. |
Arxiv 2025 | |
HiLM-D: Enhancing MLLMs with Multi-Scale High-Resolution Details for Autonomous Driving | DetailsFocuses on high-resolution understanding in MLLMs for AD, specifically for identifying, explaining, and localizing risk objects (ROLISP task), which is a critical perceptual input for safe decision-making. |
IJCV 2025 | Project |
Talk2BEV: Language-enhanced Bird's-eye View Maps for Autonomous Driving | DetailsProvides a language-enhanced interface for BEV maps, allowing natural language queries to interpret complex driving scenes represented in BEV, thus informing situational awareness for decision-making. |
ICRA 2024 | Code / Project |
Concerns predicting the future behavior of other road users (vehicles, pedestrians, cyclists) and planning the ego-vehicle's behavior in response, often involving understanding intentions and social interactions.
Method | Introduction | Year | Project |
---|---|---|---|
LanguageMPC: Large Language Models as Decision Makers for Autonomous Driving | DetailsThe LLM performs scenario encoding, which includes predicting future trajectories of other vehicles and selecting the most likely one. This predictive capability informs its high-level action guidance for the ego vehicle. |
Arxiv 2023 | |
OpenDriveVLA: Towards End-to-end Autonomous Driving with Large Vision Language Action Model | DetailsModels dynamic relationships between the ego vehicle, surrounding agents, and static road elements through an autoregressive agent-env-ego interaction process. This explicit modeling of interactions is key for behavioral planning and prediction. |
Arxiv 2025 | Code |
AlphaDrive: Unleashing the Power of VLMs in Autonomous Driving via Reinforcement Learning and Reasoning | DetailsTailored for high-level planning in autonomous driving, which inherently involves deciding on driving behaviors (e.g., lane changes, yielding) based on the current scene and predicted future states. The emergent multimodal planning capabilities also hint at understanding complex interactions. |
Arxiv 2025 | Code |
DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models | DetailsIncludes modules for scene analysis that analyze possible intent-level behavior of critical objects, feeding into its hierarchical planning. This is directly related to predicting other agents' behaviors. |
Arxiv 2024 | Project |
Empowering Autonomous Driving with Large Language Models: A Safety Perspective | DetailsFocuses on LLMs as intelligent decision-makers in behavioral planning, augmented with safety verifiers. This includes an LLM-enabled interactive behavior planning scheme. |
ICLR 2024 | |
A Language Agent for Autonomous Driving | DetailsThe LLM agent performs task planning and motion planning based on perceived and predicted states of the environment, including interactions with other agents. |
COLM 2024 | Code / Project |
LC-LLM: Explainable Lane-Change Intention and Trajectory Predictions with Large Language Models | DetailsSpecifically uses LLMs for predicting lane-change intentions and trajectories, providing explainable predictions for this critical driving behavior. |
CTR 2025 | |
Large Language Models Powered Context-aware Motion Prediction in Autonomous Driving | DetailsLeverages GPT-4V to comprehend complex traffic scenarios and combines this contextual information with traditional motion prediction models (like MTR) to improve behavioral prediction. |
IROS 2024 | Code / Project |
Focuses on generating safe, comfortable, and feasible paths or sequences of waypoints for the autonomous vehicle to follow.
Method | Introduction | Year | Project |
---|---|---|---|
LanguageMPC: Large Language Models as Decision Makers for Autonomous Driving | DetailsWhile the LLM provides high-level decisions, these are translated to guide a Model Predictive Controller (MPC) which performs the fine-grained trajectory planning and control. |
Arxiv 2023 | |
LightEMMA: Lightweight End-to-End Multimodal Model for Autonomous Driving | DetailsEvaluates VLMs on the nuScenes prediction task, where predicted control actions are numerically integrated to produce a predicted trajectory, which is then compared against ground truth. This is a form of motion planning. |
Arxiv 2025 | Code |
DriveGPT4: Interpretable End-to-end Autonomous Driving via Large Language Model | DetailsPredicts low-level vehicle control signals (speed, turning angle) in an end-to-end fashion, which implicitly defines a trajectory. |
RAL 2024 | |
AlphaDrive: Unleashing the Power of VLMs in Autonomous Driving via Reinforcement Learning and Reasoning | DetailsSpecifically designed for autonomous driving planning, generating high-level plans that would then be refined into detailed trajectories. The rewards are tailored for planning accuracy and action importance. |
Arxiv 2025 | Code |
DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models | DetailsFeatures hierarchical planning modules. The DriveVLM-Dual system uses coarse, low-frequency waypoints from the VLM as an initial plan for a faster refining planner. |
Arxiv 2024 | Project |
GPT-Driver: Learning to Drive with GPT | DetailsModels motion planning as a language modeling problem, leveraging the LLM to generate driving trajectories by aligning its output with human driving behavior. |
NeurIPS 2023 | Code / Project |
GenAD: Generative End-to-End Autonomous Driving | DetailsModels autonomous driving as a trajectory generation problem using an instance-centric scene tokenizer and a variational autoencoder for trajectory prior modeling. |
Arxiv 2024 | Code |
VLP: Vision Language Planning for Autonomous Driving | DetailsProposes a Vision Language Planning model composed of ALP (Action Localization and Prediction) and SLP (Safe Local Planning) components to improve ADS from BEV reasoning and decision-making for planning. |
CVPR 2024 | |
FutureSightDrive: Visualizing Trajectory Planning with Spatio-Temporal CoT for Autonomous Driving | DetailsThe paper presents FSDrive, a framework that enables autonomous vehicles to perform visual reasoning for trajectory planning using a spatio-temporal Chain-of-Thought (CoT). Instead of relying on abstract text-based logic, FSDrive uses a visual language model to generate future scene representations as images, capturing spatial and temporal dynamics. It introduces a lightweight pretraining method to activate image generation in existing models and employs these visual predictions as intermediate reasoning steps. |
Arxiv 2025 | Code |
Involves models that directly output low-level control commands for the vehicle, such as steering angle and acceleration/braking.
Method | Introduction | Year | Project |
---|---|---|---|
DriveGPT4: Interpretable End-to-end Autonomous Driving via Large Language Model | DetailsExplicitly states that it predicts low-level vehicle control signals (vehicle speed and turning angle) in an end-to-end fashion. |
RAL 2024 | |
LightEMMA: Lightweight End-to-End Multimodal Model for Autonomous Driving | DetailsThe final stage of its Chain-of-Thought prompting explicitly outputs a sequence of predicted control actions (e.g., [(v1, c1), (v2, c2),...]) which are then numerically integrated to produce the predicted trajectory. |
Arxiv 2025 | Code |
ADAPT: Action-aware Driving Caption Transformer | DetailsJointly trains a vehicular control signal prediction task alongside driving captioning. The CSP head predicts control signals (e.g., speed, acceleration) based on video frames. |
ICRA 2023 | Code |
Focuses on enabling autonomous vehicles to understand and respond to human language commands, preferences, or queries, facilitating more natural and intuitive interaction.
Method | Introduction | Year | Project |
---|---|---|---|
Towards Human-Centric Autonomous Driving: A Fast-Slow Architecture Integrating Large Language Model Guidance with Reinforcement Learning | Details |
Arxiv 2025 | Project |
DriveGPT4: Interpretable End-to-end Autonomous Driving via Large Language Model | DetailsCapable of processing textual queries from users and providing natural language responses, such as describing vehicle actions or explaining reasoning. |
RAL 2024 | |
LanguageMPC: Large Language Models as Decision Makers for Autonomous Driving | DetailsAllows driving behavior adjustment (e.g., conservative vs. aggressive) based on textual inputs from users or high-precision maps. |
Arxiv 2023 | |
Dolphins: Multimodal Language Model for Driving | DetailsDeveloped as a VLM-based conversational driving assistant, capable of understanding and responding to human interaction. |
Arxiv 2023 | Code / Project |
Talk2BEV: Language-enhanced Bird's-eye View Maps for Autonomous Driving | DetailsProvides a large vision-language model interface for bird’s-eye view maps, allowing users to interpret driving contexts through freeform natural language queries. |
Arxiv 2023 | Code / Project |
Drive as You Speak: Enabling Human-Like Interaction with Large Language Models in Autonomous Vehicles | DetailsIntroduces an LLM-based framework to process verbal commands from humans and make autonomous driving decisions that satisfy personalized preferences for safety, efficiency, and comfort. |
WACVW 2024 | |
Human-Centric Autonomous Systems With LLMs for User Command Reasoning | DetailsProposes leveraging LLMs' reasoning capabilities to infer system requirements from in-cabin users’ commands, using the UCU Dataset. |
WACVW 2024 | Code |
ChatGPT as Your Vehicle Co-Pilot: An Initial Attempt | DetailsDesigns a universal framework embedding LLMs as a vehicle "Co-Pilot" to accomplish specific driving tasks based on human intention and provided information. |
TIV 2023 | |
LingoQA: Visual Question Answering for Autonomous Driving | DetailsWhile a benchmark, its focus on video question answering (including action justification and scene description) directly supports the development of systems that can interactively explain their understanding and decisions to humans. |
ECCV 2024 | Code |
This section classifies papers based on the core AI techniques or methodologies they employ or innovate upon.
Many LLMs, VLMs, and VLAs are based on the Transformer architecture, known for its efficacy in handling sequential data and capturing long-range dependencies.
Method | Introduction | Year | Project |
---|---|---|---|
ADAPT: Action-aware Driving Caption Transformer | DetailsExplicitly an end-to-end transformer-based architecture. It uses a Video Swin Transformer as the visual encoder and vision-language transformers for text generation and motion prediction. |
ICRA 2023 | Code |
TeLL-Drive: Enhancing Autonomous Driving with Teacher LLM-Guided Deep Reinforcement Learning | DetailsWhile the core "teacher" is an LLM (typically Transformer-based), it guides an attention-based Student DRL policy. Self-attention mechanisms are integral to Transformers. |
Arxiv 2025 | Project |
LanguageMPC: Large Language Models as Decision Makers for Autonomous Driving | DetailsLeverages Large Language Models (LLMs), which are predominantly Transformer-based, for high-level decision-making. |
Arxiv 2023 | |
TS-VLM: Text-Guided SoftSort Pooling for Vision-Language Models in Multi-View Driving Reasoning | DetailsWhile focusing on a novel pooling module, it's designed for Vision-Language Models, many of which have Transformer backbones for either vision, language, or fusion. The paper contrasts its pooling with costly attention mechanisms (core to Transformers). |
Arxiv 2025 | |
LightEMMA: Lightweight End-to-End Multimodal Model for Autonomous Driving | DetailsEvaluates various state-of-the-art VLMs, the majority of which (e.g., GPT-4o, Gemini, Claude, LLaMA-3.2-Vision, Qwen2.5-VL) are Transformer-based. |
Arxiv 2025 | Code |
OpenDriveVLA: Towards End-to-end Autonomous Driving with Large Vision Language Action Model | DetailsBuilds upon open-source pre-trained large Vision-Language Models (VLMs) and language foundation models, which are typically Transformer architectures. |
Arxiv 2025 | Code |
DriveGPT4: Interpretable End-to-end Autonomous Driving via Large Language Model | DetailsBased on Multimodal Large Language Models (MLLMs), which are extensions of Transformer-based LLMs. |
RAL 2024 | |
Continuously Learning, Adapting, and Improving: A Dual-Process Approach to Autonomous Driving | DetailsThe Analytic Process uses an LLM (Transformer-based), and the Heuristic Process uses a lightweight language model, which could also be Transformer-based. |
NeurIPS 2024 | Code |
AlphaDrive: Unleashing the Power of VLMs in Autonomous Driving via Reinforcement Learning and Reasoning | DetailsProposes a VLM (typically Transformer-based) tailored for high-level planning. |
Arxiv 2025 | Code |
DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models | DetailsUses Vision Transformers (ViT) as the image tokenizer and Qwen (a Transformer-based LLM) as the LLM backbone. |
Arxiv 2024 | Project |
3D-VLA: A 3D Vision-Language-Action Generative World Model | DetailsBuilt on top of a 3D-based Large Language Model (LLM), which implies a Transformer architecture. |
Arxiv 2024 | Code |
These works explore or propose novel ways to combine information from different modalities (e.g., vision, language, LiDAR, radar, vehicle states) for improved decision-making. Effective fusion is critical for VLMs and VLAs. The challenge in multimodal fusion lies in effectively aligning and integrating information from disparate sources, such as pixel-level visual data and semantic language features. This becomes even more complex when dealing with multiple sensor inputs (cameras, LiDAR, radar) and dynamic temporal information inherent in driving.
Method | Introduction | Year | Project |
---|---|---|---|
TS-VLM: Text-Guided SoftSort Pooling for Vision-Language Models in Multi-View Driving Reasoning | DetailsIntroduces Text-Guided SoftSort Pooling (TGSSP) as a novel method for dynamic and query-aware multi-view visual feature aggregation, aiming for more efficient fusion than costly attention mechanisms. |
Arxiv 2025 | |
OpenDriveVLA: Towards End-to-end Autonomous Driving with Large Vision Language Action Model | DetailsProposes a hierarchical vision-language alignment process to project both 2D and 3D structured visual tokens into a unified semantic space, explicitly addressing the modality gap for language-guided trajectory generation. |
Arxiv 2025 | Code |
DriveGPT4: Interpretable End-to-end Autonomous Driving via Large Language Model | DetailsAs an MLLM, it inherently processes and fuses multi-frame video inputs with textual queries to inform its reasoning and control signal prediction. It tokenizes video sequences and text/control signals. |
RAL 2024 | |
VLM-E2E: Enhancing End-to-End Autonomous Driving with Multimodal Driver Attention Fusion | DetailsFocuses on multimodal driver attention fusion. It integrates textual representations (attentional cues from VLMs) into Bird's-Eye-View (BEV) features for semantic supervision and introduces a BEV-Text learnable weighted fusion strategy to balance contributions from visual and textual modalities. |
Arxiv 2025 | |
DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models | DetailsThe VLM architecture takes multiple image frames as input and uses Qwen as the LLM backbone, implying fusion of visual features with the language model's processing. The DriveVLM-Dual system also fuses perception information from the VLM with a traditional AV 3D perception module. |
Arxiv 2024 | Project |
3D-VLA: A 3D Vision-Language-Action Generative World Model | DetailsLinks 3D perception (point clouds, images, depth) with language and action through a generative world model. It uses a projector to efficiently align LLM output features with diffusion models for generating multimodal goal states. |
Arxiv 2024 | Code |
Holistic Autonomous Driving Understanding by Bird'View Injected Multi-Modal Large Models | DetailsProposes integrating instruction-aware BEV features with existing MLLMs to improve holistic understanding. |
CVPR 2024 | |
COMPASS: Contrastive Multimodal Pretraining for Autonomous Systems | DetailsWhile a general pre-training approach for autonomous systems, COMPASS constructs a multimodal graph to connect signals from different modalities (e.g., camera, LiDAR, IMU, odometry) and maps them into factorized spatio-temporal latent spaces for motion and state representation. This is relevant for learning fused representations for decision-making. |
IROS 2022 | Code |
Focuses on designing effective prompts to guide foundation models, especially LLMs and VLMs, to elicit desired reasoning processes and outputs for decision-making tasks. Chain-of-Thought (CoT) prompting, which encourages models to generate intermediate reasoning steps, is a prominent technique in this area. This method simulates human-like reasoning by breaking down complex problems into a sequence of manageable steps, leading to more accurate and transparent outputs, particularly for tasks requiring multi-step reasoning.
Method | Introduction | Year | Project |
---|---|---|---|
TeLL-Drive: Enhancing Autonomous Driving with Teacher LLM-Guided Deep Reinforcement Learning | DetailsExplicitly uses Chain-of-Thought (CoT) reasoning in its Teacher LLM. The LLM incorporates risk metrics, historical scenario retrieval, and domain heuristics into context-rich prompts to produce high-level driving strategies. The CoT approach helps the model iteratively evaluate collision severity, maneuver consequences, and broader traffic implications, reducing logical inconsistencies. |
Arxiv 2025 | Project |
LightEMMA: Lightweight End-to-End Multimodal Model for Autonomous Driving | DetailsEmploys a Chain-of-Thought (CoT) prompting strategy for its VLM-based autonomous driving agents. This is used to enhance interpretability and facilitate structured reasoning, with the final stage of the CoT explicitly outputting a sequence of predicted control actions. |
Arxiv 2025 | Code |
A Language Agent for Autonomous Driving | DetailsThe reasoning engine of this LLM-based agent is capable of chain-of-thought reasoning, among other capabilities like task planning and motion planning. |
COLM 2024 | Code / Project |
VLM-AD: End-to-End Autonomous Driving through Vision-Language Model Supervision | DetailsLeverages VLMs as teachers to generate reasoning-based text annotations. The annotation process involves prompting a VLM (GPT-4o) with visual input (front-view image with projected future trajectory) and specific instructions to interpret the scenario, generate reasoning, and identify ego-vehicle actions. |
Arxiv 2024 | |
Large Language Models Powered Context-aware Motion Prediction in Autonomous Driving | DetailsDesigns and conducts prompt engineering to enable unfine-tuned GPT-4V to comprehend complex traffic scenarios for motion prediction. |
IROS 2024 | Code |
Fine-Tuning Large Vision-Language Models as Decision-Making Agents via Reinforcement Learning | DetailsThis work prompts the VLM to generate chain-of-thought (CoT) reasoning to enable efficient exploration of intermediate reasoning steps that lead to the final text-based action in an RL framework. |
NeurIPS 2024 | Code / Project |
OmniDrive: A Holistic Vision-Language Dataset for Autonomous Driving with Counterfactual Reasoning | DetailsDesigns prompts for GPT-4 to generate coherent Q&A data for driving tasks by using simulated trajectories for counterfactual reasoning to identify key traffic elements and assess outcomes. |
Arxiv 2024 | Code |
Drive-R1: Bridging Reasoning and Planning in VLMs for Autonomous Driving with Reinforcement Learning | DetailsThe article proposes Drive-R1, a domain-specific vision-language model for autonomous driving that integrates structured reasoning and trajectory planning using reinforcement learning. It addresses two key limitations in current VLM-based planning systems: overdependence on textual history over visual input, and poor alignment between reasoning and planning. By combining supervised fine-tuning on a curated chain-of-thought dataset with Group Relative Policy Optimization, Drive-R1 improves trajectory accuracy and safety. Experimental results on nuScenes and DriveLM-nuScenes show marginal gains over strong baselines. While methodologically sound, the contribution is incremental—the core advancement lies more in system integration than in fundamental algorithmic innovation. |
Arxiv 2025 |
Involves transferring knowledge from larger, more capable models (like powerful proprietary LLMs/VLMs) or from diverse data sources to smaller, more efficient models suitable for deployment in autonomous vehicles, or adapting models trained in one domain (e.g., general web text/images) to the specific domain of autonomous driving.
Method | Introduction | Year | Project |
---|---|---|---|
AlphaDrive: Unleashing the Power of VLMs in Autonomous Driving via Reinforcement Learning and Reasoning | DetailsEmploys a two-stage planning reasoning training strategy that explicitly involves knowledge distillation. In the first stage, a large model (e.g., GPT-4o) generates a high-quality dataset of planning reasoning processes, which is then used to fine-tune the AlphaDrive model via Supervised Fine-Tuning (SFT), effectively distilling knowledge from the larger model. |
Arxiv 2025 | Code |
Continuously Learning, Adapting, and Improving: A Dual-Process Approach to Autonomous Driving | DetailsThe Analytic Process (System-II), which uses a powerful LLM, accumulates linguistic driving experience. This experience is then transferred to the lightweight language model of the Heuristic Process (System-I) through supervised fine-tuning. This is a form of knowledge transfer from a more capable reasoning system to a more efficient one. |
NeurIPS 2024 | Code |
VLM-AD: End-to-End Autonomous Driving through Vision-Language Model Supervision | DetailsThis method explicitly uses VLMs as "teachers" to automatically generate reasoning-based text annotations. These annotations then serve as supplementary supervisory signals to train end-to-end AD models. This process distills driving reasoning knowledge from the VLM to the student E2E model. |
Arxiv 2024 | |
Domain Knowledge Distillation from Large Language Model: An Empirical Study in the Autonomous Driving Domain | DetailsThis paper directly investigates distilling domain knowledge from LLMs (specifically ChatGPT) for the autonomous driving domain, developing a web-based distillation assistant. |
ITSC 2023 | |
Mixing Left and Right-Hand Driving Data in a Hierarchical Framework With LLM Generation | DetailsWhile focused on data compatibility, this work uses an LLM-based sample generation method and techniques like MMD to reduce the domain gap between datasets from different driving rule domains (left-hand vs. right-hand drive). This can be seen as a form of domain adaptation or transfer learning for trajectory prediction models. |
RAL 2024 |
- nuScenes : A large-scale multimodal dataset widely used for various AD tasks, including 3D object detection, tracking, and prediction. It features data from cameras, LiDAR, and radar, along with full sensor suites and map information. Several works like LightEMMA , OpenDriveVLA , and GPT-Driver utilize nuScenes for evaluation or data generation. It is also used for tasks like BEV retrieval and dense captioning.
- BDD-X (Berkeley DeepDrive eXplanation) : This dataset provides textual explanations for driving actions, making it particularly relevant for training and evaluating interpretable AD models. DriveGPT4 and ADAPT are evaluated on BDD-X. It contains video sequences with corresponding control signals and natural language narrations/reasoning.
- Waymo Open Dataset (WOMD) : A large and diverse dataset with high-resolution sensor data, including LiDAR and camera imagery. Used in works like OmniDrive for Q&A data generation and by LLMs Powered Context-aware Motion Prediction. Also used for scene simulation in ChatSim. WOMD-Reasoning is a language dataset built upon WOMD focusing on interaction descriptions and driving intentions.
- DriveLM : A benchmark and dataset focusing on driving with graph visual question answering. TS-VLM is evaluated on DriveLM. It aims to assess perception, prediction, and planning reasoning through QA pairs in a directed graph, with versions for CARLA and nuScenes.
- LingoQA : A benchmark and dataset specifically designed for video question answering in autonomous driving. It contains over 419k QA pairs from 28k unique video scenarios, covering driving reasoning, object recognition, action justification, and scene description. It also proposes the Lingo-Judge evaluation metric.
- DriveAction : DriveAction leverages real-world driving data actively collected by users of production-level autonomous vehicles to ensure broad and representative scenario coverage, provides high-level discrete behavior labels collected directly from users' actual driving operations, and implements a behavior-based tree-structured evaluation framework that explicitly links vision, language, and behavioral tasks to support comprehensive and task-specific evaluation.
- CARLA Simulator & Datasets : While a simulator, CARLA is extensively used to generate data and evaluate AD models in closed-loop settings. Works like LeapAD , LMDrive , and LangProp use CARLA for experiments and data collection. DriveLM-Carla is a specific dataset generated using CARLA.
- Argoverse : A dataset suite with a focus on motion forecasting, 3D tracking, and HD maps. Argoverse 2 is used in challenges like 3D Occupancy Forecasting.
- KITTI : One of the pioneering datasets for autonomous driving, still used for tasks like 3D object detection and tracking.
- Cityscapes : Focuses on semantic understanding of urban street scenes, primarily for semantic segmentation.
- UCU Dataset (In-Cabin User Command Understanding) : Part of the LLVM-AD Workshop, this dataset contains 1,099 labeled user commands for autonomous vehicles, designed for training models to understand human instructions within the vehicle.
- MAPLM (Large-Scale Vision-Language Dataset for Map and Traffic Scene Understanding) : Also from the LLVM-AD Workshop, MAPLM combines point cloud BEV and panoramic images for rich road scenario images and multi-level scene description data, used for QA tasks.
- NuPrompt : A large-scale language prompt set based on nuScenes for driving scenes, consisting of 3D object-text pairs, used in Prompt4Driving.
- nuDesign : A large-scale dataset (2300k sentences) constructed upon nuScenes via a rule-based auto-labeling methodology for 3D dense captioning.
- LaMPilot : An interactive environment and dataset designed for evaluating LLM-based agents in a driving context, containing scenes for command tracking tasks.
- DRAMA (Joint Risk Localization and Captioning in Driving) : Provides linguistic descriptions (with a focus on reasons) of driving risks associated with important objects.
- Rank2Tell : A multimodal ego-centric dataset for ranking importance levels of objects/events and generating textual reasons for the importance.
- HighwayEnv : A collection of environments for autonomous driving and tactical decision-making research, often used for RL-based approaches and LLM decision-making evaluations (e.g., by DiLu, MTD-GPT).
These repositories offer broader collections of resources that may overlap with or complement the focus of this list.
- Awesome-LLM4AD (LLM for Autonomous Driving):
- GitHub: https://github.com/Thinklab-SJTU/Awesome-LLM4AD
- Alternative Link:http://codesandbox.io/p/github/sorokinvld/Awesome-LLM4AD
- Description: A curated list of research papers about LLM-for-Autonomous-Driving, categorized by planning, perception, question answering, and generation. Continuously updated.
- Awesome-VLLMs (Vision Large Language Models):
- GitHub: JackYFL/awesome-VLLMs
- Description: Collects papers on Visual Large Language Models, with a dedicated section for "Vision-to-action" including "Autonomous driving" (Perception, Planning, Prediction).
- Awesome-Data-Centric-Autonomous-Driving:
- GitHub:https://github.com/LincanLi98/Awesome-Data-Centric-Autonomous-Driving
- Description: Focuses on data-driven AD solutions, including datasets, data mining, and closed-loop technologies. Mentions the role of LLMs/VLMs in scene understanding and decision-making.
- Awesome-World-Model (for Autonomous Driving and Robotics):
- GitHub:https://github.com/LMD0311/Awesome-World-Model
- Description: Records, tracks, and benchmarks recent World Models for AD or Robotics, supplementary to a survey paper. Includes many VLA-related and generative model papers.
- Awesome-Multimodal-LLM-Autonomous-Driving:
- GitHub:https://github.com/IrohXu/Awesome-Multimodal-LLM-Autonomous-Driving
- Description: A systematic investigation of Multimodal LLMs in autonomous driving, covering background, tools, frameworks, datasets, and future directions.
- Awesome VLM Architectures:
- GitHub: gokayfem/awesome-vlm-architectures
- Description: Contains information on famous Vision Language Models (VLMs), including details about their architectures, training procedures, and datasets.
If you find this repository useful, please consider citing this list:
@misc{liu2025lm4decisionofad,
title = {Awesome-LM-AD-Decision},
author = {Jiaqi Liu, Chengkai Xu},
journal = {GitHub repository},
url = {https://github.com/Jiaaqiliu/Awesome-LM-AD-Decision},
year = {2025},
}