Drive4C is a capability-driven, closed-loop benchmark designed to evaluate multimodal large language models (MLLMs) in the context of language-guided autonomous driving. It decomposes the evaluation process into core understanding capabilities to identify specific model limitations and areas for targeted improvement.
Language-guided autonomous driving has emerged as a promising paradigm in autonomous systems development, leveraging the open-context description, reasoning, and interpretation capabilities of multimodal large language models (MLLMs). However, existing benchmarks only provide overall scores and fail to assess the core capabilities required for language-guided driving. They do not reveal why models struggle with autonomous navigation, limiting targeted improvements.
We present Drive4C, a novel closed-loop benchmark for systematically evaluating MLLMs based on four core capabilities derived from human driver requirements: semantic, spatial, temporal, and physical understanding. Drive4C separates the evaluation into scenario description, scenario anticipation, and language-guided motion, allowing for fine-grained capability evaluation. The two-step evaluation process of question-answering and instruction-based driving tasks ensures a modular and capability-specific performance analysis.
Experimental results show that state-of-the-art models perform well in semantic understanding and scenario anticipation, but struggle with spatial, temporal, and physical understanding, uncovering the potential for targeted model improvements.
- [04/25] Drive4C accepted at CVPR WDFM-AD 2025 π
- [TBD] Benchmark code to be released soon
Model | SEM | SPA | TEM | PHY | ANT | LGM | Score |
---|---|---|---|---|---|---|---|
Dolphins | 0.4241 | 0.0587 | 0.2182 | 0.0162 | 0.4720 | 0.0448 | 0.1413 |
Llama-3.2-11B-Vision | 0.4820 | 0.1461 | 0.1994 | 0.0802 | 0.5769 | 0.0268 | 0.1619 |
Phi-4-Multimodal | 0.7482 | 0.1959 | 0.2217 | 0.0367 | 0.4428 | 0.0388 | 0.1839 |
SmolVLM | 0.7256 | 0.3153 | 0.2223 | 0.0813 | 0.5772 | 0.0186 | 0.2015 |
DriveMM | 0.8059 | 0.2776 | 0.2937 | 0.0367 | 0.4376 | 0.0970 | 0.2337 |
Gemma 3-27B-it | 0.8445 | 0.3076 | 0.2542 | 0.1726 | 0.6540 | 0.1049 | 0.2757 |
GPT-4o | 0.8422 | 0.3587 | 0.3498 | 0.1703 | 0.6421 | 0.1298 | 0.3012 |
If you find our work useful, please consider citing us!
@InProceedings{Sohn_2025_CVPR,
author = {Sohn, Tin Stribor and Dillitzer, Maximilian and Bach, Johannes and Corso, Jason J. and Br\"uhl, Tim and Schwager, Robin and Eberhardt, Tim Dieter and Sax, Eric},
title = {Drive4C: A Closed-Loop Benchmark on What Foundation Models Really Need to Be Capable of for Language-Guided Autonomous Driving},
booktitle = {Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR) Workshops},
month = {June},
year = {2025},
pages = {3859-3869}
}
@misc{sohn2025frameworkcapabilitydrivenevaluationscenario,
title={A Framework for a Capability-driven Evaluation of Scenario Understanding for Multimodal Large Language Models in Autonomous Driving},
author={Tin Stribor Sohn and Philipp Reis and Maximilian Dillitzer and Johannes Bach and Jason J. Corso and Eric Sax},
year={2025},
eprint={2503.11400},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2503.11400},
}