EndoBench: A Comprehensive Evaluation of Multi-Modal Large Language Models for Endoscopy Analysis

🤖 Homepage | 🤗 Dataset | 📖 Paper

Shengyuan Liu^1* Boyun Zheng^1* Wenting Chen^2* Zhihao Peng¹ Zhenfei Yin³ Jing Shao⁴ Jiancong Hu⁵ Yixuan Yuan^1✉

¹Chinese University of Hong Kong ²City University of Hong Kong ³University of Oxford

⁴Shanghai AI Laboratory ⁵The Sixth Affiliated Hospital, Sun Yat-sen University

^* Equal Contributions. ^✉ Corresponding Author.

This repository is the official implementation of the paper EndoBench: A Comprehensive Evaluation of Multi-Modal Large Language Models for Endoscopy Analysis.

🚀Overview

In this paper, we introduce EndoBench, the first comprehensive benchmark specifically designed to assess MLLMs across the full spectrum of endoscopic practice with multi-dimensional capacities.

EndoBench encompasses 4 distinct endoscopic scenarios, 12 specialized clinical tasks with 12 secondary subtasks, and 5 levels of visual prompting granularities, resulting in 6,832 rigorously validated VQA pairs from 21 diverse datasets.

EndoBench establishes a new standard for evaluating and advancing MLLMs in endoscopy, highlighting both progress and persistent gaps between current models and expert clinical reasoning.

📦Evaluation

We provide a comprehensive evaluation of the following MLLMs on EndoBench:

This project is built upon VLMEvalKit. To get started:

Visit the VLMEvalKit Quickstart Guide for installation instructions. or you can run the following command for a quick start:

git clone https://github.com/CUHK-AIM-Group/EndoBench.git
cd EndoBench
pip install -e .

You can evaluate your model with the following command:

python run.py --data EndoBench --model Your_model_name

Demo: Qwen2.5-VL-7B-Instruct on EndoBench, Inference only

python run.py --data EndoBench --model Qwen2.5-VL-7B-Instruct --mode infer

We provide the Performance comparison across 4 major categories in EndoBench among existing MLLMs: For More Details, please see our paper.

🔍 Insights

Endoscopy remains a challenging domain for MLLMs, with significant gaps between models and human expertise. Human experts achieve an average accuracy of 74.12% in endoscopy tasks, while the top-performing model, Gemini-2.5-Pro, reaches only 49.53%—a gap of roughly 25%.
Medical domain-specific Supervised Fine-Tuning markedly boosts model performance. Medical models that underwent domain-specific supervised fine-tuning, such as MedDr and HuatuoGPT-Vision-34B, perform exceptionally well in tasks like landmark identification and organ recognition, even outperforming all proprietary models.
Model performance varies with visual prompt formats, exposing a gap between visual perception and medical comprehension. The ability of models to understand spatial information varies significantly based on how visual prompts are formatted.
Polyp counting exposes dual challenges in lesion identification and numerical reasoning. Our findings highlight the importance of incorporating domain-specific medical knowledge into MLLMs to enhance their performance in tasks that combine visual analysis with clinical expertise.

🎈Acknowledgements

Greatly appreciate the tremendous effort for the following projects!

Note: This dataset is built based on multiple public datasets. The sources of these datasets have been clearly indicated in the paper. Users should abide by the relevant licenses and terms of use of the original datasets: Kvasir, HyperKvasir, Kvasir-Capsule, GastroVision, KID, WCEBleedGen, SEE-AI, Kvasir-Seg, CVC-ColonDB, ETIS-Larib, CVC-ClinicDB, CVC-300, EDD2020, SUN-Database, LDPolypVideo, PolypGen, Cholec80, EndoVis-17, EndoVis-18, and PSI-AVA.

Greatly appreciate all the authors of these datasets for their contributions to the field of endoscopy analysis.

📜Citation

If you find this work helpful for your project, please consider citing our paper.

@article{liu2025endobench,
  author={Shengyuan Liu and Boyun Zheng and Wenting Chen and Zhihao Peng and Zhenfei Yin and Jing Shao and Jiancong Hu and Yixuan Yuan},
  title={A Comprehensive Evaluation of Multi-Modal Large Language Models for Endoscopy Analysis},
  journal={arXiv preprint arXiv:2505.23601},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
assets		assets
requirements		requirements
scripts		scripts
vlmeval		vlmeval
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
run.py		run.py
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

EndoBench: A Comprehensive Evaluation of Multi-Modal Large Language Models for Endoscopy Analysis

🚀Overview

📦Evaluation

🔍 Insights

🎈Acknowledgements

📜Citation

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

CUHK-AIM-Group/EndoBench

Folders and files

Latest commit

History

Repository files navigation

EndoBench: A Comprehensive Evaluation of Multi-Modal Large Language Models for Endoscopy Analysis

🚀Overview

📦Evaluation

🔍 Insights

🎈Acknowledgements

📜Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages