OCR-Reasoning Benchmark: Unveiling the True Capabilities of MLLMs in Complex Text-Rich Image Reasoning
🌐 Homepage | 🤗 Hugging Face | 🏆 Leaderboard | 📑 Paper
Recent advancements in multimodal slow-thinking systems have demonstrated remarkable performance across diverse visual reasoning tasks. However, their capabilities in text-rich image reasoning tasks remain understudied due to the lack of a systematic benchmark. To address this gap, we propose OCR-Reasoning, a comprehensive benchmark designed to systematically assess Multimodal Large Language Models on text-rich image reasoning tasks. The benchmark comprises 1,069 human-annotated examples spanning 6 core reasoning abilities and 18 practical reasoning tasks in text-rich visual scenarios. Furthermore, unlike other text-rich image understanding benchmarks that only annotate the final answers, OCR-Reasoning also annotates the reasoning process simultaneously. With the annotated reasoning process and the final answers, OCR-Reasoning evaluates not only the final answers generated by models but also their reasoning processes, enabling a holistic analysis of their problem-solving abilities. Leveraging this benchmark, we conducted a comprehensive evaluation of state-of-the-art MLLMs. Our results demonstrate the limitations of existing methodologies. Notably, even state-of-the-art MLLMs exhibit substantial difficulties, with none achieving accuracy surpassing 50% across OCR-Reasoning, indicating that the challenges of text-rich image reasoning are an urgent issue to be addressed.
-
[
05/24/2025
]: The evaluation of OCR-Reasoning is now supported in VLMEvalKit. -
[
05/22/2025
]: Our paper is now accessible at arXiv. -
[
05/18/2025
]: Release the dataset and evaluation script.
-
We concretely define various core sub-abilities for text-rich image reasoning. OCR-Reasoning comprises a meticulously collected 1,069 human-annotated examples spanning 6 core reasoning abilities: spatial Reasoning, numerical analysis reasoning, mathematical reasoning, enumerative reasoning, logical reasoning, and multidisciplinary knowledge reasoning.
-
The visual information in images is crucial for the OCR Reasoning task. When we replace images with OCR results and feed them into LLMs, we observe that their accuracy is relatively low. This indicates that text alone is insufficient for solving text-rich image reasoning tasks.
-
Existing models still have room for improvement in OCR reasoning tasks. Even state-of-the-art MLLMs exhibit substantial difficulties, with none achieving accuracy surpassing 50% across OCR-Reasoning.
-
Existing reinforcement learning methods perform poorly on text-rich image reasoning tasks. Designing reinforcement learning for text-rich image reasoning is a potential direction for enhancing text-rich image reasoning capabilities.
We have integrated OCR-Reasoning into the VLMEvalKit framework. For the environment configuration and the use of API, please refer to VLMEvalKit. Clone this repo and run the evaluate script. The code will automatically download images and annotations from HuggingFace.
git clone https://github.com/SCUT-DLVCLab/OCR-Reasoning
cd OCR_Reasoning
python run.py --data OCR_Reasoning --model Qwen2.5-VL-7B-Instruct --verbose
OCR-Reasoning is licensed under CC BY-NC-SA 4.0.
If you find OCR-Reasoning helpful, please consider giving this repo a ⭐ and citing:
@article{huang2025ocreasoning,
title={OCR-Reasoning Benchmark: Unveiling the True Capabilities of MLLMs in Complex Text-Rich Image Reasoning},
author={Mingxin Huang and Yongxin Shi and Dezhi Peng and Songxuan Lai and Zecheng Xie and Lianwen Jin},
journal={arXiv preprint arXiv:2505.17163},
year={2025},
}
Thanks for your support!