Citrus-V: Advancing Medical Foundation Models with Unified Medical Image Grounding for Clinical Reasoning
In clinical practice, physicians routinely operate in highly multimodal environments, where medical imaging plays a central role in diagnosis, treatment planning, and surgical decision-making. Accurate interpretation of imaging data is indispensable, as it provides critical evidence that complements textual reports, laboratory results, and patient history. Consequently, any artificial intelligence system intended for clinical deployment must be capable of integrating visual and textual information at a fine-grained, pixel-level resolution while supporting structured reasoning and clinically grounded decision-making.
Existing medical imaging models are largely designed as expert systems specialized for narrow tasks such as lesion detection, segmentation, classification, or report generation. These models often require multiple specialized networks to cover different organs, disease types, or diagnostic tasks, and they rarely generalize effectively across diverse clinical scenarios. While large-scale language and multimodal models have demonstrated remarkable progress, including strong reasoning capabilities and multi-task generalization, applying them to real-world clinical settings remains challenging.
Clinical tasks demand not only multimodal understanding but also precise visual grounding and integrated chain-of-thought reasoning to interpret complex medical data, support decision-making workflows, and provide reliable second opinions with explainability and clinical fidelity. Existing multimodal medical approaches often fail to provide pixel-level, fine-grained visual insights or to integrate heterogeneous data modalities effectively, which limits their utility in comprehensive diagnostic reasoning.
Building upon our prior work, Citrus: Leveraging Expert Cognitive Pathways in a Medical Language Model for Advanced Medical Decision Support, which introduced a language-based medical foundation model incorporating expert-inspired reasoning pathways, we now present Citrus-V. This upgraded multimodal medical foundation model addresses the critical need for integrating medical images into clinical decision support systems.
Citrus-V makes the following key contributions to the field of medical AI:
-
Unified Integration of Visual and Reasoning Capabilities: We construct a unified model that integrates detection, segmentation, and multimodal chain-of-thought reasoning, enabling pixel-level lesion localization, structured report generation, and physician-like diagnostic inference within a single model.
-
Comprehensive Open-Source Data Suite: To facilitate reproducibility and support the research community, we release Citrus-V along with a curated open-source data suite, including:
- A multimodal chain-of-thought reasoning dataset for report generation
- A refined detection and segmentation benchmark with corrected labels
- A medical document understanding benchmark with graded difficulty levels
-
Novel Multimodal Training Paradigm: We design a novel multimodal training paradigm to accelerate convergence and enhance generalization across diverse imaging and reasoning tasks.
Extensive experiments demonstrate that Citrus-V surpasses existing open-source medical foundation models and expert-level imaging systems across multiple benchmarks, establishing new state-of-the-art performance in both visual and multimodal tasks. By providing a complete pipeline from visual grounding to clinical reasoning, Citrus-V offers critical support for precise lesion quantification, automated radiology reporting, and reliable second opinions, marking a significant step toward general-purpose medical foundation models and the broader adoption of AI in clinical practice.
- Unified Medical Image Grounding: Advanced techniques for precise localization and understanding of medical images at the pixel level
- Comprehensive Clinical Reasoning: Integration of medical knowledge graphs and clinical guidelines with multimodal chain-of-thought reasoning
- Multi-modal Medical Understanding: Seamlessly process images, text, and structured data from electronic health records
- Medical Image Analysis: Support for various medical imaging modalities (CT, MRI, X-ray, ultrasound, etc.) with detection and segmentation capabilities
- Medical OCR: Specialized optical character recognition for medical documents and reports
- Fine-grained Control: Adjustable parameters for different medical specialties and use cases
- Efficient Training Pipeline: Optimized for medical datasets with packing and streaming capabilities
To install Citrus-V:
# Clone the repository
cd /Users/liuenyou.leo/Documents/ley/citrus-v/citrus-V
# Install dependencies
pip install -r requirements/requirements.txt
# Install the package
pip install -e .
Here's a quick example to get started with Citrus-V for medical image analysis:
CUDA_VISIBLE_DEVICES=0 \
python inference.py \
--model path/to/citrus-v-model \
--image_path path/to/medical_image.jpg \
--task grounding \
--output_dir ./results
CUDA_VISIBLE_DEVICES=0 \
python inference_seg.py \
--model path/to/citrus-v-seg-model \
--image_path path/to/medical_image.jpg \
--output_dir ./segmentation_results
Citrus-V supports various medical AI tasks through a unified interface. Below are some common usage examples:
# Multi-node training with medical image datasets
NPROC_PER_NODE=8 \
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
swift sft \
--model path/to/base-model \
--dataset medsam2d_seg_wooverlap_29k_20250901 medseg_bench_10k_oversample \
--train_type full \
--torch_dtype bfloat16 \
--max_length 12288 \
--num_train_epochs 3 \
--learning_rate 1e-4 \
--per_device_train_batch_size 1 \
--output_dir ./citrus-v-medical-checkpoints
# Processing medical documents with OCR and understanding
CUDA_VISIBLE_DEVICES=0 \
python app.py \
--model path/to/citrus-v-model \
--doc_path path/to/medical_report.pdf \
--task medical_doc_qa
Citrus-V supports various medical datasets for training and evaluation:
- Medical Image Datasets: MedSAM, MedSegBench, and custom medical image collections
- Medical Document Datasets: OCR-processed medical reports, prescriptions, and test results
- Clinical Question Answering: Medical Q&A pairs with clinical reasoning chains
- Grounding Datasets: Medical images with detailed annotations and region-specific descriptions
Citrus-V can be evaluated on various medical AI benchmarks:
# Evaluate on medical image grounding benchmarks
CUDA_VISIBLE_DEVICES=0 \
swift eval \
--model path/to/citrus-v-model \
--eval_dataset medical_grounding_benchmark \
--eval_backend evalscope \
--output_dir ./evaluation_results
This project is licensed under the Apache License (Version 2.0). For models and datasets, please refer to the original resource page and follow the corresponding License.
If you use Citrus-V in your research, please cite our work:
@article{citrusv2024,
title={Citrus-V: Advancing Medical Foundation Models with Unified Medical Image Grounding for Clinical Reasoning},
author={Your Name and Contributors},
journal={arXiv preprint arXiv:XXXX.XXXXX},
year={2024}
}
This repository is for the homepage of Citrus-V.