This repo contains code for the "The Optimization Paradox in Clinical AI Multi-Agent Systems" paper. It demonstrates how optimizing individual components can catastrophically undermine overall system performance in multi-agent clinical AI systems. The framework enables evaluation of both single-agent and multi-agent workflows on real patient cases from the MIMIC-CDM dataset using multiple LLM families.
It currently supports 8 different LLM families and provides comprehensive evaluation metrics including diagnostic accuracy, process adherence, and cost efficiency.
📖 Table of Contents
- 🚀 Quick Start
- 📊 What This Does
- 🏥 Key Finding
- 📈 Results & Evaluation
- 🔧 Supported Models
- 📋 Requirements
- 📚 Citation
- 📧 Issues
- Install dependencies
conda env create -f environment.yaml
conda activate clinagent_env
- Configure APIs
cp config.example.yaml config.yaml
# Edit config.yaml with your API keys
- Run evaluation
# Single agent
python3 run_single_agent.py --model_id_main gpt --dataset_type val
# Multi-agent
python3 run_multi_agent.py --model_id_info gemini --model_id_diagnosis gpt --dataset_type val
Tests clinical reasoning on 2,400 real patient cases across 4 abdominal conditions:
- Single-agent: One model handles everything
- Multi-agent: Specialized models for information gathering, interpretation, and diagnosis
- Best-of-Breed: Top-performing components combined (spoiler: performs worst!)
The Best-of-Breed system built from individually optimal components achieved only 67.7% accuracy vs 77.4% for a well-integrated multi-agent system, despite superior process metrics.
python3 run_evals.py --log_dir logs/<experiment_name>
Results include diagnostic accuracy, process adherence, and cost metrics.
Azure OpenAI, Claude, Gemini, Llama, o3-mini, DeepSeek
- Python 3.10+
- API keys for your chosen models
- MIMIC-CDM dataset access
(Placeholder for future publication citation.)
Please report issues by creating an issue on this GitHub repository.