Skip to content

A Closed-Loop Benchmark on What Foundation Models Really Need to Be Capable of for Language-Guided Autonomous Driving

Notifications You must be signed in to change notification settings

porscheofficial/Drive4C

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

21 Commits
Β 
Β 
Β 
Β 

Repository files navigation

Drive4C: A Closed-Loop Benchmark on What Foundation Models Really Need to Be Capable of for Language-Guided Autonomous Driving

arXiv PDF YouTube

🧭 Overview

Drive4C is a capability-driven, closed-loop benchmark designed to evaluate multimodal large language models (MLLMs) in the context of language-guided autonomous driving. It decomposes the evaluation process into core understanding capabilities to identify specific model limitations and areas for targeted improvement.

πŸ“– Abstract

Language-guided autonomous driving has emerged as a promising paradigm in autonomous systems development, leveraging the open-context description, reasoning, and interpretation capabilities of multimodal large language models (MLLMs). However, existing benchmarks only provide overall scores and fail to assess the core capabilities required for language-guided driving. They do not reveal why models struggle with autonomous navigation, limiting targeted improvements.

We present Drive4C, a novel closed-loop benchmark for systematically evaluating MLLMs based on four core capabilities derived from human driver requirements: semantic, spatial, temporal, and physical understanding. Drive4C separates the evaluation into scenario description, scenario anticipation, and language-guided motion, allowing for fine-grained capability evaluation. The two-step evaluation process of question-answering and instruction-based driving tasks ensures a modular and capability-specific performance analysis.

Experimental results show that state-of-the-art models perform well in semantic understanding and scenario anticipation, but struggle with spatial, temporal, and physical understanding, uncovering the potential for targeted model improvements.

πŸ“Ή Demo Video

Watch the video

πŸ”₯ News

  • [04/25] Drive4C accepted at CVPR WDFM-AD 2025 πŸŽ‰
  • [TBD] Benchmark code to be released soon

🧠 Evaluated Models

Model SEM SPA TEM PHY ANT LGM Score
Dolphins 0.42410.05870.21820.01620.47200.04480.1413
Llama-3.2-11B-Vision0.48200.14610.19940.08020.57690.02680.1619
Phi-4-Multimodal 0.74820.19590.22170.03670.44280.03880.1839
SmolVLM 0.72560.31530.22230.08130.57720.01860.2015
DriveMM 0.80590.27760.29370.03670.43760.09700.2337
Gemma 3-27B-it 0.84450.30760.25420.17260.65400.10490.2757
GPT-4o 0.84220.35870.34980.17030.64210.12980.3012

πŸ“ Paper and Citation

If you find our work useful, please consider citing us!

@InProceedings{Sohn_2025_CVPR,
    author    = {Sohn, Tin Stribor and Dillitzer, Maximilian and Bach, Johannes and Corso, Jason J. and Br\"uhl, Tim and Schwager, Robin and Eberhardt, Tim Dieter and Sax, Eric},
    title     = {Drive4C: A Closed-Loop Benchmark on What Foundation Models Really Need to Be Capable of for Language-Guided Autonomous Driving},
    booktitle = {Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR) Workshops},
    month     = {June},
    year      = {2025},
    pages     = {3859-3869}
}
@misc{sohn2025frameworkcapabilitydrivenevaluationscenario,
      title={A Framework for a Capability-driven Evaluation of Scenario Understanding for Multimodal Large Language Models in Autonomous Driving}, 
      author={Tin Stribor Sohn and Philipp Reis and Maximilian Dillitzer and Johannes Bach and Jason J. Corso and Eric Sax},
      year={2025},
      eprint={2503.11400},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2503.11400}, 
}

(back to top)

About

A Closed-Loop Benchmark on What Foundation Models Really Need to Be Capable of for Language-Guided Autonomous Driving

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •