Skip to content

CUHK-AIM-Group/EndoBench

Repository files navigation

EndoBench: A Comprehensive Evaluation of Multi-Modal Large Language Models for Endoscopy Analysis

🤖 Homepage | 🤗 Dataset | 📖 Paper

Shengyuan Liu1* Boyun Zheng1* Wenting Chen2* Zhihao Peng1 Zhenfei Yin3 Jing Shao4 Jiancong Hu5 Yixuan Yuan1✉

1Chinese University of Hong Kong   2City University of Hong Kong   3University of Oxford  

4Shanghai AI Laboratory   5The Sixth Affiliated Hospital, Sun Yat-sen University  

* Equal Contributions. Corresponding Author.

This repository is the official implementation of the paper EndoBench: A Comprehensive Evaluation of Multi-Modal Large Language Models for Endoscopy Analysis.

🚀Overview

In this paper, we introduce EndoBench, the first comprehensive benchmark specifically designed to assess MLLMs across the full spectrum of endoscopic practice with multi-dimensional capacities.

EndoBench encompasses 4 distinct endoscopic scenarios, 12 specialized clinical tasks with 12 secondary subtasks, and 5 levels of visual prompting granularities, resulting in 6,832 rigorously validated VQA pairs from 21 diverse datasets.

EndoBench establishes a new standard for evaluating and advancing MLLMs in endoscopy, highlighting both progress and persistent gaps between current models and expert clinical reasoning.

overview

📦Evaluation

We provide a comprehensive evaluation of the following MLLMs on EndoBench:

  1. This project is built upon VLMEvalKit. To get started:

Visit the VLMEvalKit Quickstart Guide for installation instructions. or you can run the following command for a quick start:

git clone https://github.com/CUHK-AIM-Group/EndoBench.git
cd EndoBench
pip install -e .
  1. You can evaluate your model with the following command:
python run.py --data EndoBench --model Your_model_name 

Demo: Qwen2.5-VL-7B-Instruct on EndoBench, Inference only

python run.py --data EndoBench --model Qwen2.5-VL-7B-Instruct --mode infer

We provide the Performance comparison across 4 major categories in EndoBench among existing MLLMs: comparison For More Details, please see our paper.

🔍 Insights

  1. Endoscopy remains a challenging domain for MLLMs, with significant gaps between models and human expertise. Human experts achieve an average accuracy of 74.12% in endoscopy tasks, while the top-performing model, Gemini-2.5-Pro, reaches only 49.53%—a gap of roughly 25%.

  2. Medical domain-specific Supervised Fine-Tuning markedly boosts model performance. Medical models that underwent domain-specific supervised fine-tuning, such as MedDr and HuatuoGPT-Vision-34B, perform exceptionally well in tasks like landmark identification and organ recognition, even outperforming all proprietary models.

  3. Model performance varies with visual prompt formats, exposing a gap between visual perception and medical comprehension. The ability of models to understand spatial information varies significantly based on how visual prompts are formatted.

  4. Polyp counting exposes dual challenges in lesion identification and numerical reasoning. Our findings highlight the importance of incorporating domain-specific medical knowledge into MLLMs to enhance their performance in tasks that combine visual analysis with clinical expertise.

🎈Acknowledgements

Greatly appreciate the tremendous effort for the following projects!

Note: This dataset is built based on multiple public datasets. The sources of these datasets have been clearly indicated in the paper. Users should abide by the relevant licenses and terms of use of the original datasets: Kvasir, HyperKvasir, Kvasir-Capsule, GastroVision, KID, WCEBleedGen, SEE-AI, Kvasir-Seg, CVC-ColonDB, ETIS-Larib, CVC-ClinicDB, CVC-300, EDD2020, SUN-Database, LDPolypVideo, PolypGen, Cholec80, EndoVis-17, EndoVis-18, and PSI-AVA.

Greatly appreciate all the authors of these datasets for their contributions to the field of endoscopy analysis.

📜Citation

If you find this work helpful for your project, please consider citing our paper.

@article{liu2025endobench,
  author={Shengyuan Liu and Boyun Zheng and Wenting Chen and Zhihao Peng and Zhenfei Yin and Jing Shao and Jiancong Hu and Yixuan Yuan},
  title={A Comprehensive Evaluation of Multi-Modal Large Language Models for Endoscopy Analysis},
  journal={arXiv preprint arXiv:2505.23601},
  year={2025}
}

About

EndoBench: A Comprehensive Evaluation of Multi-Modal Large Language Models for Endoscopy Analysis

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages