Citrus-V: Advancing Medical Foundation Models with Unified Medical Image Grounding for Clinical Reasoning

📖 Table of Contents

Introduction
Key Features
Installation
Quick Start
Usage
Datasets
Evaluation
License
Citation

📝 Introduction

In clinical practice, physicians routinely operate in highly multimodal environments, where medical imaging plays a central role in diagnosis, treatment planning, and surgical decision-making. Accurate interpretation of imaging data is indispensable, as it provides critical evidence that complements textual reports, laboratory results, and patient history. Consequently, any artificial intelligence system intended for clinical deployment must be capable of integrating visual and textual information at a fine-grained, pixel-level resolution while supporting structured reasoning and clinically grounded decision-making.

Existing medical imaging models are largely designed as expert systems specialized for narrow tasks such as lesion detection, segmentation, classification, or report generation. These models often require multiple specialized networks to cover different organs, disease types, or diagnostic tasks, and they rarely generalize effectively across diverse clinical scenarios. While large-scale language and multimodal models have demonstrated remarkable progress, including strong reasoning capabilities and multi-task generalization, applying them to real-world clinical settings remains challenging.

Clinical tasks demand not only multimodal understanding but also precise visual grounding and integrated chain-of-thought reasoning to interpret complex medical data, support decision-making workflows, and provide reliable second opinions with explainability and clinical fidelity. Existing multimodal medical approaches often fail to provide pixel-level, fine-grained visual insights or to integrate heterogeneous data modalities effectively, which limits their utility in comprehensive diagnostic reasoning.

Building upon our prior work, Citrus: Leveraging Expert Cognitive Pathways in a Medical Language Model for Advanced Medical Decision Support, which introduced a language-based medical foundation model incorporating expert-inspired reasoning pathways, we now present Citrus-V. This upgraded multimodal medical foundation model addresses the critical need for integrating medical images into clinical decision support systems.

✨ Key Contributions

Citrus-V makes the following key contributions to the field of medical AI:

Unified Integration of Visual and Reasoning Capabilities: We construct a unified model that integrates detection, segmentation, and multimodal chain-of-thought reasoning, enabling pixel-level lesion localization, structured report generation, and physician-like diagnostic inference within a single model.
Comprehensive Open-Source Data Suite: To facilitate reproducibility and support the research community, we release Citrus-V along with a curated open-source data suite, including:
- A multimodal chain-of-thought reasoning dataset for report generation
- A refined detection and segmentation benchmark with corrected labels
- A medical document understanding benchmark with graded difficulty levels
Novel Multimodal Training Paradigm: We design a novel multimodal training paradigm to accelerate convergence and enhance generalization across diverse imaging and reasoning tasks.

Extensive experiments demonstrate that Citrus-V surpasses existing open-source medical foundation models and expert-level imaging systems across multiple benchmarks, establishing new state-of-the-art performance in both visual and multimodal tasks. By providing a complete pipeline from visual grounding to clinical reasoning, Citrus-V offers critical support for precise lesion quantification, automated radiology reporting, and reliable second opinions, marking a significant step toward general-purpose medical foundation models and the broader adoption of AI in clinical practice.

🔍 Key Features

Unified Medical Image Grounding: Advanced techniques for precise localization and understanding of medical images at the pixel level
Comprehensive Clinical Reasoning: Integration of medical knowledge graphs and clinical guidelines with multimodal chain-of-thought reasoning
Multi-modal Medical Understanding: Seamlessly process images, text, and structured data from electronic health records
Medical Image Analysis: Support for various medical imaging modalities (CT, MRI, X-ray, ultrasound, etc.) with detection and segmentation capabilities
Medical OCR: Specialized optical character recognition for medical documents and reports
Fine-grained Control: Adjustable parameters for different medical specialties and use cases
Efficient Training Pipeline: Optimized for medical datasets with packing and streaming capabilities

🛠️ Installation

To install Citrus-V:

# Clone the repository
cd /Users/liuenyou.leo/Documents/ley/citrus-v/citrus-V

# Install dependencies
pip install -r requirements/requirements.txt

# Install the package
pip install -e .

🚀 Quick Start

Here's a quick example to get started with Citrus-V for medical image analysis:

Medical Image Grounding

CUDA_VISIBLE_DEVICES=0 \
python inference.py \
    --model path/to/citrus-v-model \
    --image_path path/to/medical_image.jpg \
    --task grounding \
    --output_dir ./results

Medical Image Segmentation

CUDA_VISIBLE_DEVICES=0 \
python inference_seg.py \
    --model path/to/citrus-v-seg-model \
    --image_path path/to/medical_image.jpg \
    --output_dir ./segmentation_results

📊 Usage

Citrus-V supports various medical AI tasks through a unified interface. Below are some common usage examples:

Training on Medical Datasets

# Multi-node training with medical image datasets
NPROC_PER_NODE=8 \
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
swift sft \
    --model path/to/base-model \
    --dataset medsam2d_seg_wooverlap_29k_20250901 medseg_bench_10k_oversample \
    --train_type full \
    --torch_dtype bfloat16 \
    --max_length 12288 \
    --num_train_epochs 3 \
    --learning_rate 1e-4 \
    --per_device_train_batch_size 1 \
    --output_dir ./citrus-v-medical-checkpoints

Medical Document Understanding

# Processing medical documents with OCR and understanding
CUDA_VISIBLE_DEVICES=0 \
python app.py \
    --model path/to/citrus-v-model \
    --doc_path path/to/medical_report.pdf \
    --task medical_doc_qa

📁 Datasets

Citrus-V supports various medical datasets for training and evaluation:

Medical Image Datasets: MedSAM, MedSegBench, and custom medical image collections
Medical Document Datasets: OCR-processed medical reports, prescriptions, and test results
Clinical Question Answering: Medical Q&A pairs with clinical reasoning chains
Grounding Datasets: Medical images with detailed annotations and region-specific descriptions

📈 Evaluation

Citrus-V can be evaluated on various medical AI benchmarks:

# Evaluate on medical image grounding benchmarks
CUDA_VISIBLE_DEVICES=0 \
swift eval \
    --model path/to/citrus-v-model \
    --eval_dataset medical_grounding_benchmark \
    --eval_backend evalscope \
    --output_dir ./evaluation_results

🏛 License

This project is licensed under the Apache License (Version 2.0). For models and datasets, please refer to the original resource page and follow the corresponding License.

📎 Citation

If you use Citrus-V in your research, please cite our work:

@article{citrusv2024,
  title={Citrus-V: Advancing Medical Foundation Models with Unified Medical Image Grounding for Clinical Reasoning},
  author={Your Name and Contributors},
  journal={arXiv preprint arXiv:XXXX.XXXXX},
  year={2024}
}

This repository is for the homepage of Citrus-V.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
architectures		architectures
asset		asset
docs		docs
eval		eval
examples		examples
requirements		requirements
scripts		scripts
swift		swift
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.pre-commit-config_local.yaml		.pre-commit-config_local.yaml
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
CONTRIBUTING_CN.md		CONTRIBUTING_CN.md
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
Makefile		Makefile
README-b.md		README-b.md
README.md		README.md
README_CN.md		README_CN.md
app.py		app.py
check_weights.py		check_weights.py
compare_model_weights.py		compare_model_weights.py
custom_mllm.py		custom_mllm.py
deploy.sh		deploy.sh
freeze_top_parameters.py		freeze_top_parameters.py
inference.py		inference.py
inference_seg.py		inference_seg.py
kill_python.sh		kill_python.sh
oss_upload.py		oss_upload.py
setup.cfg		setup.cfg
setup.py		setup.py
test_deploy.py		test_deploy.py
train.sh		train.sh
train_multi_nodes.sh		train_multi_nodes.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Citrus-V: Advancing Medical Foundation Models with Unified Medical Image Grounding for Clinical Reasoning

📖 Table of Contents

📝 Introduction

✨ Key Contributions

🔍 Key Features

🛠️ Installation

🚀 Quick Start

Medical Image Grounding

Medical Image Segmentation

📊 Usage

Training on Medical Datasets

Medical Document Understanding

📁 Datasets

📈 Evaluation

🏛 License

📎 Citation

About

Uh oh!

Releases

Packages

Languages

License

jdh-algo/Citrus-V

Folders and files

Latest commit

History

Repository files navigation

Citrus-V: Advancing Medical Foundation Models with Unified Medical Image Grounding for Clinical Reasoning

📖 Table of Contents

📝 Introduction

✨ Key Contributions

🔍 Key Features

🛠️ Installation

🚀 Quick Start

Medical Image Grounding

Medical Image Segmentation

📊 Usage

Training on Medical Datasets

Medical Document Understanding

📁 Datasets

📈 Evaluation

🏛 License

📎 Citation

About

Resources

License

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages