Skip to content

jdh-algo/Citrus-V

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Citrus-V: Advancing Medical Foundation Models with Unified Medical Image Grounding for Clinical Reasoning



📖 Table of Contents

📝 Introduction

In clinical practice, physicians routinely operate in highly multimodal environments, where medical imaging plays a central role in diagnosis, treatment planning, and surgical decision-making. Accurate interpretation of imaging data is indispensable, as it provides critical evidence that complements textual reports, laboratory results, and patient history. Consequently, any artificial intelligence system intended for clinical deployment must be capable of integrating visual and textual information at a fine-grained, pixel-level resolution while supporting structured reasoning and clinically grounded decision-making.

Existing medical imaging models are largely designed as expert systems specialized for narrow tasks such as lesion detection, segmentation, classification, or report generation. These models often require multiple specialized networks to cover different organs, disease types, or diagnostic tasks, and they rarely generalize effectively across diverse clinical scenarios. While large-scale language and multimodal models have demonstrated remarkable progress, including strong reasoning capabilities and multi-task generalization, applying them to real-world clinical settings remains challenging.

Clinical tasks demand not only multimodal understanding but also precise visual grounding and integrated chain-of-thought reasoning to interpret complex medical data, support decision-making workflows, and provide reliable second opinions with explainability and clinical fidelity. Existing multimodal medical approaches often fail to provide pixel-level, fine-grained visual insights or to integrate heterogeneous data modalities effectively, which limits their utility in comprehensive diagnostic reasoning.

Building upon our prior work, Citrus: Leveraging Expert Cognitive Pathways in a Medical Language Model for Advanced Medical Decision Support, which introduced a language-based medical foundation model incorporating expert-inspired reasoning pathways, we now present Citrus-V. This upgraded multimodal medical foundation model addresses the critical need for integrating medical images into clinical decision support systems.

✨ Key Contributions

Citrus-V makes the following key contributions to the field of medical AI:

  1. Unified Integration of Visual and Reasoning Capabilities: We construct a unified model that integrates detection, segmentation, and multimodal chain-of-thought reasoning, enabling pixel-level lesion localization, structured report generation, and physician-like diagnostic inference within a single model.

  2. Comprehensive Open-Source Data Suite: To facilitate reproducibility and support the research community, we release Citrus-V along with a curated open-source data suite, including:

    • A multimodal chain-of-thought reasoning dataset for report generation
    • A refined detection and segmentation benchmark with corrected labels
    • A medical document understanding benchmark with graded difficulty levels
  3. Novel Multimodal Training Paradigm: We design a novel multimodal training paradigm to accelerate convergence and enhance generalization across diverse imaging and reasoning tasks.

Extensive experiments demonstrate that Citrus-V surpasses existing open-source medical foundation models and expert-level imaging systems across multiple benchmarks, establishing new state-of-the-art performance in both visual and multimodal tasks. By providing a complete pipeline from visual grounding to clinical reasoning, Citrus-V offers critical support for precise lesion quantification, automated radiology reporting, and reliable second opinions, marking a significant step toward general-purpose medical foundation models and the broader adoption of AI in clinical practice.

🔍 Key Features

  • Unified Medical Image Grounding: Advanced techniques for precise localization and understanding of medical images at the pixel level
  • Comprehensive Clinical Reasoning: Integration of medical knowledge graphs and clinical guidelines with multimodal chain-of-thought reasoning
  • Multi-modal Medical Understanding: Seamlessly process images, text, and structured data from electronic health records
  • Medical Image Analysis: Support for various medical imaging modalities (CT, MRI, X-ray, ultrasound, etc.) with detection and segmentation capabilities
  • Medical OCR: Specialized optical character recognition for medical documents and reports
  • Fine-grained Control: Adjustable parameters for different medical specialties and use cases
  • Efficient Training Pipeline: Optimized for medical datasets with packing and streaming capabilities

🛠️ Installation

To install Citrus-V:

# Clone the repository
cd /Users/liuenyou.leo/Documents/ley/citrus-v/citrus-V

# Install dependencies
pip install -r requirements/requirements.txt

# Install the package
pip install -e .

🚀 Quick Start

Here's a quick example to get started with Citrus-V for medical image analysis:

Medical Image Grounding

CUDA_VISIBLE_DEVICES=0 \
python inference.py \
    --model path/to/citrus-v-model \
    --image_path path/to/medical_image.jpg \
    --task grounding \
    --output_dir ./results

Medical Image Segmentation

CUDA_VISIBLE_DEVICES=0 \
python inference_seg.py \
    --model path/to/citrus-v-seg-model \
    --image_path path/to/medical_image.jpg \
    --output_dir ./segmentation_results

📊 Usage

Citrus-V supports various medical AI tasks through a unified interface. Below are some common usage examples:

Training on Medical Datasets

# Multi-node training with medical image datasets
NPROC_PER_NODE=8 \
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
swift sft \
    --model path/to/base-model \
    --dataset medsam2d_seg_wooverlap_29k_20250901 medseg_bench_10k_oversample \
    --train_type full \
    --torch_dtype bfloat16 \
    --max_length 12288 \
    --num_train_epochs 3 \
    --learning_rate 1e-4 \
    --per_device_train_batch_size 1 \
    --output_dir ./citrus-v-medical-checkpoints

Medical Document Understanding

# Processing medical documents with OCR and understanding
CUDA_VISIBLE_DEVICES=0 \
python app.py \
    --model path/to/citrus-v-model \
    --doc_path path/to/medical_report.pdf \
    --task medical_doc_qa

📁 Datasets

Citrus-V supports various medical datasets for training and evaluation:

  • Medical Image Datasets: MedSAM, MedSegBench, and custom medical image collections
  • Medical Document Datasets: OCR-processed medical reports, prescriptions, and test results
  • Clinical Question Answering: Medical Q&A pairs with clinical reasoning chains
  • Grounding Datasets: Medical images with detailed annotations and region-specific descriptions

📈 Evaluation

Citrus-V can be evaluated on various medical AI benchmarks:

# Evaluate on medical image grounding benchmarks
CUDA_VISIBLE_DEVICES=0 \
swift eval \
    --model path/to/citrus-v-model \
    --eval_dataset medical_grounding_benchmark \
    --eval_backend evalscope \
    --output_dir ./evaluation_results

🏛 License

This project is licensed under the Apache License (Version 2.0). For models and datasets, please refer to the original resource page and follow the corresponding License.

📎 Citation

If you use Citrus-V in your research, please cite our work:

@article{citrusv2024,
  title={Citrus-V: Advancing Medical Foundation Models with Unified Medical Image Grounding for Clinical Reasoning},
  author={Your Name and Contributors},
  journal={arXiv preprint arXiv:XXXX.XXXXX},
  year={2024}
}

This repository is for the homepage of Citrus-V.

About

Citrus-V: Advancing Medical Foundation Models with Unified Medical Image Grounding for Clinical Reasoning

Resources

License

Code of conduct

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages