MedM-VL is a modular, LLaVA-based codebase for medical LVLMs, supporting flexible customization of encoders, connectors, and LLMs.
MedM-VL focuses on small-scale medical LVLMs, designed for direct deployment in real-world medical scenarios or efficient fine-tuning on downstream tasks.
- [2025.04.10]: The model weights (v1.0) have been uploaded to Hugging Face.
- [2025.04.06]: The technical report has been released on arXiv.
- [2024.12.19]: The complete code has been released on GitHub.
MedM-VL (v1.0: single image input, more details on Hugging Face)
- shiym2000/MedM-VL-2D-3B-en · Hugging Face: Trained on 2D medical images and English medical texts.
- shiym2000/MedM-VL-CT-Chest-3B-en · Hugging Face: Trained on 3D chest CT volumes and English medical texts.
# 1. clone and navigate
git clone https://github.com/MSIIP/MedM-VL.git
cd MedM-VL
# 2. create a conda environment, activate it and install packages
conda create -n medm python=3.10
conda activate medm
pip install -r requirements.txt
pip install flash-attn --no-build-isolation
If you are confused about some parameters during usage, please refer to Parameter Interpretation.
# For 2D medical LVLMs
# 1. pre-train (annotation format: docs/example_2d_pretrain.json)
bash scripts/train/MedM-VL-2D/pretrain_en.sh
# 2. fine-tune (annotation format: docs/example_2d_finetune.json)
bash scripts/train/MedM-VL-2D/finetune_en.sh
# For 3D medical LVLMs
# 1. pre-train (annotation format: docs/example_3d_pretrain.json)
bash scripts/train/MedM-VL-CT-Chest/pretrain_en.sh
# 2. fine-tune (annotation format: docs/example_3d_finetune.json)
bash scripts/train/MedM-VL-CT-Chest/finetune_en.sh
# In fact, there is no difference in the annotation file format between
# pre-training and fine-tuning. The former is from image-text pairs
# while the latter refers to instruction tuning data.
# For 2D medical LVLMs
# 1. download weights from Hugging Face
pip install -U huggingface_hub
huggingface-cli download --resume-download shiym2000/MedM-VL-2D-3B-en --local-dir work_dirs/MedM-VL-2D-3B-en
# 2. fine-tune using LoRA (annotation format: docs/example_2d_finetune.json)
bash scripts/train/finetune_2d.sh
# For 3D medical LVLMs
# 1. download weights from Hugging Face
pip install -U huggingface_hub
huggingface-cli download --resume-download shiym2000/MedM-VL-CT-Chest-3B-en --local-dir work_dirs/MedM-VL-CT-Chest-3B-en
# 2. fine-tune using LoRA (annotation format: docs/example_3d_finetune.json)
bash scripts/train/finetune_3d.sh
# You can choose full or LoRA fine-tuning based on available GPU memory.
# For 2D medical LVLMs
# inference (annotation format: docs/example_2d_inference.json)
bash scripts/eval/inference_2d.sh
# For 3D medical LVLMs
# inference (annotation format: docs/example_3d_inference.json)
bash scripts/eval/inference_3d.sh
# Compared to `finetune.json``, `conversations` in `inference.json` lacks
# the final response, which will be generated by the model.
# Launch a Gradio demo locally.
bash scripts/playground.sh
Encoder | Connector | LLM |
|
|
|
@article{shi2025medm,
title={MedM-VL: What Makes a Good Medical LVLM?},
author={Shi, Yiming and Yang, Shaoshuai and Zhu, Xun and Wang, Haoyu and Li, Miao and Wu, Ji},
journal={arXiv preprint arXiv:2504.04323},
year={2025}
}
We would like to express our gratitude to the following resources:
- TinyLLaVA_Factory - An open-source modular codebase for small-scale large multimodal models (LMMs).