👋 Introduction

OCR-Reasoning Benchmark: Unveiling the True Capabilities of MLLMs in Complex Text-Rich Image Reasoning

Mingxin Huang¹, Yongxin Shi¹, Dezhi Peng^2, Songxuan Lai², Zecheng Xie², Lianwen Jin^1
¹ South China University of Technology ² Huawei Cloud

🌐 Homepage | 🤗 Hugging Face | 🏆 Leaderboard | 📑 Paper

👋 Introduction

Recent advancements in multimodal slow-thinking systems have demonstrated remarkable performance across diverse visual reasoning tasks. However, their capabilities in text-rich image reasoning tasks remain understudied due to the lack of a systematic benchmark. To address this gap, we propose OCR-Reasoning, a comprehensive benchmark designed to systematically assess Multimodal Large Language Models on text-rich image reasoning tasks. The benchmark comprises 1,069 human-annotated examples spanning 6 core reasoning abilities and 18 practical reasoning tasks in text-rich visual scenarios. Furthermore, unlike other text-rich image understanding benchmarks that only annotate the final answers, OCR-Reasoning also annotates the reasoning process simultaneously. With the annotated reasoning process and the final answers, OCR-Reasoning evaluates not only the final answers generated by models but also their reasoning processes, enabling a holistic analysis of their problem-solving abilities. Leveraging this benchmark, we conducted a comprehensive evaluation of state-of-the-art MLLMs. Our results demonstrate the limitations of existing methodologies. Notably, even state-of-the-art MLLMs exhibit substantial difficulties, with none achieving accuracy surpassing 50% across OCR-Reasoning, indicating that the challenges of text-rich image reasoning are an urgent issue to be addressed.

🔥 News

[05/24/2025]: The evaluation of OCR-Reasoning is now supported in VLMEvalKit.
[05/22/2025]: Our paper is now accessible at arXiv.
[05/18/2025]: Release the dataset and evaluation script.

📌 Highlights

We concretely define various core sub-abilities for text-rich image reasoning. OCR-Reasoning comprises a meticulously collected 1,069 human-annotated examples spanning 6 core reasoning abilities: spatial Reasoning, numerical analysis reasoning, mathematical reasoning, enumerative reasoning, logical reasoning, and multidisciplinary knowledge reasoning.
The visual information in images is crucial for the OCR Reasoning task. When we replace images with OCR results and feed them into LLMs, we observe that their accuracy is relatively low. This indicates that text alone is insufficient for solving text-rich image reasoning tasks.
Existing models still have room for improvement in OCR reasoning tasks. Even state-of-the-art MLLMs exhibit substantial difficulties, with none achieving accuracy surpassing 50% across OCR-Reasoning.
Existing reinforcement learning methods perform poorly on text-rich image reasoning tasks. Designing reinforcement learning for text-rich image reasoning is a potential direction for enhancing text-rich image reasoning capabilities.

🔨 Evaluation

We have integrated OCR-Reasoning into the VLMEvalKit framework. For the environment configuration and the use of API, please refer to VLMEvalKit. Clone this repo and run the evaluate script. The code will automatically download images and annotations from HuggingFace.

git clone https://github.com/SCUT-DLVCLab/OCR-Reasoning
cd OCR_Reasoning
python run.py --data OCR_Reasoning --model Qwen2.5-VL-7B-Instruct --verbose

📖 Main Results

🐳 Dataset Examples

📜 License

OCR-Reasoning is licensed under CC BY-NC-SA 4.0.

✒️Citation

If you find OCR-Reasoning helpful, please consider giving this repo a ⭐ and citing:

@article{huang2025ocreasoning,
      title={OCR-Reasoning Benchmark: Unveiling the True Capabilities of MLLMs in Complex Text-Rich Image Reasoning}, 
      author={Mingxin Huang and Yongxin Shi and Dezhi Peng and Songxuan Lai and Zecheng Xie and Lianwen Jin},
      journal={arXiv preprint arXiv:2505.17163},
      year={2025},
}

Thanks for your support!

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
assets		assets
docs		docs
requirements		requirements
scripts		scripts
vlmeval		vlmeval
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
run.py		run.py
run.sh		run.sh
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

OCR-Reasoning Benchmark: Unveiling the True Capabilities of MLLMs in Complex Text-Rich Image Reasoning

Mingxin Huang¹, Yongxin Shi¹, Dezhi Peng^2, Songxuan Lai², Zecheng Xie², Lianwen Jin^1
¹ South China University of Technology ² Huawei Cloud

👋 Introduction

🔥 News

📌 Highlights

🔨 Evaluation

📖 Main Results

🐳 Dataset Examples

📜 License

✒️Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

License

SCUT-DLVCLab/OCR-Reasoning

Folders and files

Latest commit

History

Repository files navigation

OCR-Reasoning Benchmark: Unveiling the True Capabilities of MLLMs in Complex Text-Rich Image Reasoning

Mingxin Huang1, Yongxin Shi1, Dezhi Peng2*, Songxuan Lai2, Zecheng Xie2, Lianwen Jin1* 1 South China University of Technology 2 Huawei Cloud

👋 Introduction

🔥 News

📌 Highlights

🔨 Evaluation

📖 Main Results

🐳 Dataset Examples

📜 License

✒️Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Mingxin Huang¹, Yongxin Shi¹, Dezhi Peng^2, Songxuan Lai², Zecheng Xie², Lianwen Jin^1
¹ South China University of Technology ² Huawei Cloud

Packages