Skip to content

Turbo-AGI/MedM-VL

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MedM-VL: What Makes a Good Medical LVLM?

arXiv hf_space License

architecture

MedM-VL is a modular, LLaVA-based codebase for medical LVLMs, supporting flexible customization of encoders, connectors, and LLMs.

MedM-VL focuses on small-scale medical LVLMs, designed for direct deployment in real-world medical scenarios or efficient fine-tuning on downstream tasks.

📰 News

✨ Features

MedM-VL (v1.0: single image input, more details on Hugging Face)

📦 Installation

# 1. clone and navigate
git clone https://github.com/MSIIP/MedM-VL.git
cd MedM-VL

# 2. create a conda environment, activate it and install packages
conda create -n medm python=3.10
conda activate medm
pip install -r requirements.txt
pip install flash-attn --no-build-isolation

🚀 Getting Started

If you are confused about some parameters during usage, please refer to Parameter Interpretation.

1. Train a general medical LVLM from scratch

# For 2D medical LVLMs
# 1. pre-train (annotation format: docs/example_2d_pretrain.json)
bash scripts/train/MedM-VL-2D/pretrain_en.sh
# 2. fine-tune (annotation format: docs/example_2d_finetune.json)
bash scripts/train/MedM-VL-2D/finetune_en.sh

# For 3D medical LVLMs
# 1. pre-train (annotation format: docs/example_3d_pretrain.json)
bash scripts/train/MedM-VL-CT-Chest/pretrain_en.sh
# 2. fine-tune (annotation format: docs/example_3d_finetune.json)
bash scripts/train/MedM-VL-CT-Chest/finetune_en.sh

# In fact, there is no difference in the annotation file format between
# pre-training and fine-tuning. The former is from image-text pairs
# while the latter refers to instruction tuning data.

2. Fine-tune a specialized medical LVLM with pre-trained weights

# For 2D medical LVLMs
# 1. download weights from Hugging Face
pip install -U huggingface_hub
huggingface-cli download --resume-download shiym2000/MedM-VL-2D-3B-en --local-dir work_dirs/MedM-VL-2D-3B-en
# 2. fine-tune using LoRA (annotation format: docs/example_2d_finetune.json)
bash scripts/train/finetune_2d.sh

# For 3D medical LVLMs
# 1. download weights from Hugging Face
pip install -U huggingface_hub
huggingface-cli download --resume-download shiym2000/MedM-VL-CT-Chest-3B-en --local-dir work_dirs/MedM-VL-CT-Chest-3B-en
# 2. fine-tune using LoRA (annotation format: docs/example_3d_finetune.json)
bash scripts/train/finetune_3d.sh

# You can choose full or LoRA fine-tuning based on available GPU memory.

3. Inference

# For 2D medical LVLMs
# inference (annotation format: docs/example_2d_inference.json)
bash scripts/eval/inference_2d.sh

# For 3D medical LVLMs
# inference (annotation format: docs/example_3d_inference.json)
bash scripts/eval/inference_3d.sh

# Compared to `finetune.json``, `conversations` in `inference.json` lacks
# the final response, which will be generated by the model.

4. Demo

# Launch a Gradio demo locally.
bash scripts/playground.sh

🤖 Model Zoo

Encoder Connector LLM
  • CLIP (2021)
  • SigLIP (2023)
  • M3D-CLIP (2023)
  • MedM-CLIP
  • MLP
  • Spatial Pooling
  • Attention Pooling
  • Phi-2 (2023)
  • Phi-3 (2024)
  • Qwen2.5 (2024)
  • Llama-3.2 (2024)
  • 📖 Citation

    @article{shi2025medm,
      title={MedM-VL: What Makes a Good Medical LVLM?},
      author={Shi, Yiming and Yang, Shaoshuai and Zhu, Xun and Wang, Haoyu and Li, Miao and Wu, Ji},
      journal={arXiv preprint arXiv:2504.04323},
      year={2025}
    }

    ❤️ Acknowledgements

    We would like to express our gratitude to the following resources:

    • TinyLLaVA_Factory - An open-source modular codebase for small-scale large multimodal models (LMMs).

    About

    MedM-VL is a modular, LLaVA-based codebase for medical LVLMs.

    Resources

    License

    Stars

    Watchers

    Forks

    Releases

    No releases published

    Packages

    No packages published

    Languages

    • Python 80.6%
    • Shell 19.4%