This repository contains the code and resources for a study investigating the effectiveness of a low-cost prompt engineering strategy for Emergency Severity Index (ESI) classification using Large Language Models (LLMs).
The goal of this project is to provide a transparent and reproducible framework for evaluating the performance of LLMs in clinical triage tasks, with a rigorous focus on preventing data leakage to simulate a realistic decision-making scenario.
Overcrowding in emergency departments is a global challenge. This study explores how prompt engineering, a low-cost alternative to fine-tuning, can be used to guide LLMs to achieve high performance in patient classification, aligning with the clinical reasoning of experts.
The main script (llm-esi-triage.py
) runs an experiment on a validated subset of the MIETIC dataset, generates ESI predictions, and produces a complete set of performance reports and artifacts for analysis.
Our experiments demonstrate strong performance using prompt engineering for ESI classification:
Model | Accuracy | Quadratic Kappa | F1-Score (Weighted) |
---|---|---|---|
GPT-4.1 | 86.1% | 0.948 | 0.857 |
GPT-5 | 88.9% | 0.961 | 0.890 |
Results based on 36 validated cases from the MIETIC dataset.
Key Findings:
- Both models achieved high quadratic kappa scores (>0.94), indicating excellent agreement with expert classifications
- GPT-5 shows modest improvement over GPT-4.1 across all metrics
- Perfect classification achieved for ESI-4 and ESI-5 categories with GPT-5
A draft of the accompanying paper for this research, "High-Performance Emergency Triage Classification Using Cost-Effective Prompt Engineering," is available for viewing. As a work-in-progress, feedback is welcome.
Read the Paper Draft on Google Docs
- Methodological Rigor: Implements a comprehensive exclusion list to prevent data leakage of clinical outcomes, ensuring a fair evaluation.
- Full Reproducibility: Uses random seeds and saves all configurations, metrics, and prompts in metadata files for each run.
- Comprehensive Reporting: Generates multiple artifacts for each experiment, including detailed logs, raw predictions, metrics in JSON format, a human-readable summary report, and a confusion matrix visualization.
- Robust Code: Structured with software best practices, including dataclass configuration, professional logging, and error handling.
This experiment uses the MIMIC-IV-Ext Triage Instruction Corpus (MIETIC), publicly available on the PhysioNet platform.
- Source: https://physionet.org/content/mietic/1.0.0/
- File Used:
MIETIC-validate-samples.csv
The script automatically filters the dataset to use only the 36 cases where the Final Decision
was validated as 'RETAIN' by experts, ensuring a high-quality ground truth.
Follow the steps below to replicate the experiment.
- Python 3.8 or higher
- An OpenAI account with API access
-
Clone the repository:
git clone https://github.com/DiegoZoracKy/research-llm-esi-triage.git cd research-llm-esi-triage
-
Create a virtual environment (recommended):
python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate
-
Install the dependencies:
pip install -r requirements.txt
-
Set up your OpenAI API Key: The script requires the
OPENAI_API_KEY
to be set as an environment variable.export OPENAI_API_KEY="your_key_here"
With everything set up, run the script from your terminal:
python llm-esi-triage.py
The script will create a new directory inside the results/
folder for each run, containing all the generated artifacts.
For each run, a new folder will be created, for example: results/20250730_231450_gpt-4.1_ESI_vv4.5/
. Inside it, you will find:
predictions.csv
: The most detailed file, with a row for each patient, including the exact prompt sent, the raw LLM response, the extracted prediction, and the actual value.metrics.json
: All performance metrics (accuracy, Kappa, F-1 score, etc.) in a structured JSON format.metadata.json
: The complete "recipe" for the experiment, including the configuration used and the prompt templates.summary_report.txt
: A human-readable summary of the results, ideal for a quick analysis.confusion_matrix.png
: A visualization of the confusion matrix, ready to be used in presentations or the paper.confusion_matrix.csv
: The data for the confusion matrix in CSV format.experiment.log
: A detailed log of the entire script execution, useful for debugging.error_cases.csv
: If any errors occur, this file will contain the cases that failed, to facilitate analysis.
This project is licensed under the MIT License. See the LICENSE
file for more details.