This is the repository for Generative Large Language Models Trained for Detecting Errors in Radiology Reports.
The overall workflow of large language models (LLMs).
Our work consists of three phases:
(1). Dataset Construction
(2). Model Development
(3). Evaluation
We constructed a dataset consisting of two parts.
The first part includes 1,656 synthetic radiology reports generated by GPT-4 using specified prompts, divided into 828 error-free synthetic reports and 828 synthetic reports with errors.
Please refer to Prompts_for_Synthetic.txt
The second part comprises 614 reports: 307 errorfree reports from the MIMIC-CXR database, and 307 corresponding synthetic reports with errors generated by GPT-4 based on these MIMIC-CXR reports and specified prompts.
Please refer to Prompts_for_MIMIC.txt
We fine-tune our models using Firefly codes.
Please refer to Firefly(https://github.com/yangjianxin1/Firefly)
Llama-3-8B-Instruct and Llama-3-70B-Instruct are fine-tuned on the training set with the following hyperparameters:
Hyperparameter | Llama-3-8B-Instruct | Llama-3-70B-Instruct |
---|---|---|
Batch size | 1 | 1 |
Learning rate | 3e-4 | 3e-4 |
Epochs | 3 | 3 |
Max length | 512 | 512 |
We evaluated the performance of models such as Llama-3 and GPT-4 on the test set.
Please refer to demo.ipynb for the relevant code.
Please cite the repo if you use the data or code in this repository.
@article{sun2025generative,
title={Generative large language models trained for detecting errors in radiology reports},
author={Sun, Cong and Teichman, Kurt and Zhou, Yiliang and Critelli, Brian and Nauheim, David and Keir, Graham and Wang, Xindi and Zhong, Judy and Flanders, Adam E and Shih, George and others},
journal={Radiology},
volume={315},
number={2},
pages={e242575},
year={2025},
publisher={Radiological Society of North America}
}