GitHub - Turbo-AGI/MedEvalKit: MedEvalKit: A Unified Medical Evaluation Framework

🩺 MedEvalKit: A Unified Medical Evaluation Framework

📖 arXiv Paper • 🤗 Lingshu Models • 🌐 Lingshu Project Page

📌 Introduction

A comprehensive evaluation framework for Large Medical Models (LMMs/LLMs) in the healthcare domain.
We welcome contributions of new models, benchmarks, or enhanced evaluation metrics!

🔥 Latest News

2025-06-12 - Initial release of MedEvalKit v1.0!

🧪 Supported Benchmarks

Multimodal Medical Benchmarks	Text-Only Medical Benchmarks
MMMU-Medical-test	MedQA-USMLE
MMMU-Medical-val	MedMCQA
PMC_VQA	PubMedQA
OmniMedVQA	Medbullets-op4
IU XRAY	Medbullets-op5
MedXpertQA-Multimodal	MedXpertQA-Text
CheXpert Plus	SuperGPQA
MIMIC-CXR	HealthBench
VQA-RAD	CMB
SLAKE	CMExam
PATH-VQA	CMMLU
MedFrameQA	MedQA-MCMLE

🤖 Supported Models

HuggingFace Exclusive

BiMediX2
BiomedGPT
HealthGPT
Janus
Med_Flamingo
MedDr
MedGemma
NVILA
VILA_M3

HF + vLLM Compatible

HuatuoGPT-vision
InternVL
Llama_3.2-vision
LLava
LLava_Med
Qwen2_5_VL
Qwen2_VL

🛠️ Installation

# Clone repository
git clone https://github.com/DAMO-NLP-SG/MedEvalKit
cd MedEvalKit

# Install dependencies
pip install -r requirements.txt
pip install 'open_clip_torch[training]'
pip install flash-attn --no-build-isolation

# For LLaVA-like models
git clone https://github.com/LLaVA-VL/LLaVA-NeXT.git
cd LLaVA-NeXT && pip install -e .

📂 Dataset Preparation

HuggingFace Datasets (Direct Access)

# Set DATASETS_PATH='hf'
VQA-RAD: flaviagiammarino/vqa-rad
SuperGPQA: m-a-p/SuperGPQA
PubMedQA: openlifescienceai/pubmedqa
PATHVQA: flaviagiammarino/path-vqa
MMMU: MMMU/MMMU
MedQA-USMLE: GBaker/MedQA-USMLE-4-options
MedQA-MCMLE: shuyuej/MedQA-MCMLE-Benchmark
Medbullets_op4: tuenguyen/Medical-Eval-MedBullets_op4
Medbullets_op5: LangAGI-Lab/medbullets_op5
CMMMU: haonan-li/cmmlu
CMExam: fzkuji/CMExam
CMB: FreedomIntelligence/CMB
MedFrameQA: SuhaoYu1020/MedFrameQA

Local Datasets (Manual Download Required)

Dataset	Source
MedXpertQA	TsinghuaC3I
SLAKE	BoKelvin
PMCVQA	RadGenome
OmniMedVQA	foreverbeliever
MIMIC_CXR	MIMIC_CXR
IU_Xray	IU_Xray
CheXpert Plus	CheXpert Plus
HealthBench	Normal,Hard,Consensus

🚀 Quick Start

1. Configure `eval.sh`

#!/bin/bash
export HF_ENDPOINT=https://hf-mirror.com
# MMMU-Medical-test,MMMU-Medical-val,PMC_VQA,MedQA_USMLE,MedMCQA,PubMedQA,OmniMedVQA,Medbullets_op4,Medbullets_op5,MedXpertQA-Text,MedXpertQA-MM,SuperGPQA,HealthBench,IU_XRAY,CheXpert_Plus,MIMIC_CXR,CMB,CMExam,CMMLU,MedQA_MCMLE,VQA_RAD,SLAKE,PATH_VQA,MedFrameQA
EVAL_DATASETS="Medbullets_op4" 
DATASETS_PATH="hf"
OUTPUT_PATH="eval_results/{}"
# TestModel,Qwen2-VL,Qwen2.5-VL,BiMediX2,LLava_Med,Huatuo,InternVL,Llama-3.2,LLava,Janus,HealthGPT,BiomedGPT,Vllm_Text,MedGemma,Med_Flamingo,MedDr
MODEL_NAME="Qwen2.5-VL"
MODEL_PATH="Qwen2.5-VL-7B-Instruct"

#vllm setting
CUDA_VISIBLE_DEVICES="0"
TENSOR_PARALLEL_SIZE="1"
USE_VLLM="False"

#Eval setting
SEED=42
REASONING="False"
TEST_TIMES=1


# Eval LLM setting
MAX_NEW_TOKENS=8192
MAX_IMAGE_NUM=6
TEMPERATURE=0
TOP_P=0.0001
REPETITION_PENALTY=1

# LLM judge setting
USE_LLM_JUDGE="True"
# gpt api model name
GPT_MODEL="gpt-4.1-2025-04-14"
OPENAI_API_KEY=""


# pass hyperparameters and run python sccript
python eval.py \
    --eval_datasets "$EVAL_DATASETS" \
    --datasets_path "$DATASETS_PATH" \
    --output_path "$OUTPUT_PATH" \
    --model_name "$MODEL_NAME" \
    --model_path "$MODEL_PATH" \
    --seed $SEED \
    --cuda_visible_devices "$CUDA_VISIBLE_DEVICES" \
    --tensor_parallel_size "$TENSOR_PARALLEL_SIZE" \
    --use_vllm "$USE_VLLM" \
    --max_new_tokens "$MAX_NEW_TOKENS" \
    --max_image_num "$MAX_IMAGE_NUM" \
    --temperature "$TEMPERATURE"  \
    --top_p "$TOP_P" \
    --repetition_penalty "$REPETITION_PENALTY" \
    --reasoning "$REASONING" \
    --use_llm_judge "$USE_LLM_JUDGE" \
    --judge_gpt_model "$GPT_MODEL" \
    --openai_api_key "$OPENAI_API_KEY" \
    --test_times "$TEST_TIMES"

2. Run Evaluation

chmod +x eval.sh  # Add execute permission
./eval.sh

📜 Citation

@misc{lasateam2025lingshugeneralistfoundationmodel,
      title={Lingshu: A Generalist Foundation Model for Unified Multimodal Medical Understanding and Reasoning}, 
      author={LASA Team and Weiwen Xu and Hou Pong Chan and Long Li and Mahani Aljunied and Ruifeng Yuan and Jianyu Wang and Chenghao Xiao and Guizhen Chen and Chaoqun Liu and Zhaodonghui Li and Yu Sun and Junao Shen and Chaojun Wang and Jie Tan and Deli Zhao and Tingyang Xu and Hao Zhang and Yu Rong},
      year={2025},
      eprint={2506.07044},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2506.07044}, 
}

_{Built with ❤️ by the DAMO Academy Medical AI Team}

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
models		models
utils		utils
.gitignore		.gitignore
LLMs.py		LLMs.py
Readme.md		Readme.md
benchmarks.py		benchmarks.py
eval.py		eval.py
eval.sh		eval.sh
eval_chunked.sh		eval_chunked.sh
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🩺 MedEvalKit: A Unified Medical Evaluation Framework

📌 Introduction

🔥 Latest News

🧪 Supported Benchmarks

🤖 Supported Models

HuggingFace Exclusive

HF + vLLM Compatible

🛠️ Installation

📂 Dataset Preparation

HuggingFace Datasets (Direct Access)

Local Datasets (Manual Download Required)

🚀 Quick Start

1. Configure `eval.sh`

2. Run Evaluation

📜 Citation

About

Uh oh!

Releases

Packages

Languages

Turbo-AGI/MedEvalKit

Folders and files

Latest commit

History

Repository files navigation

🩺 MedEvalKit: A Unified Medical Evaluation Framework

📌 Introduction

🔥 Latest News

🧪 Supported Benchmarks

🤖 Supported Models

HuggingFace Exclusive

HF + vLLM Compatible

🛠️ Installation

📂 Dataset Preparation

HuggingFace Datasets (Direct Access)

Local Datasets (Manual Download Required)

🚀 Quick Start

1. Configure eval.sh

2. Run Evaluation

📜 Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

1. Configure `eval.sh`

Packages