Research collaboration between the University of Aberdeen and NHS Grampian
This project investigates the use of Natural Language Processing (NLP) models to predict patient surgical selection from pre-operative radiology reports. The aim is to evaluate the performance of advanced transformer models (e.g., GatorTron, RoBERTa) on this classification task using a rigorous cross-validation approach to ensure reproducibility and robustness.
The project framework originates from Dr. Luke Farrow’s GatorTron model, developed within the SHAIP secure environment on real clinical data. To enable open experimentation and sharing, this repository adapts that pipeline using synthetic radiology report data and includes modifications with additional models such as RoBERTa variants. We also thank the developers of PyTorch, Hugging Face, scikit-learn, and other open-source libraries that made this work possible.
The repository is structured to reflect the project's workflow:
-
src/
: Contains all refactored Python scripts for data loading, modeling, and evaluation.
Includes aprivate_data_processing/
subfolder with R scripts used for initial data linkage (no patient identifiers included).
notebooks/
: Stores Jupyter notebooks used for initial model development and exploratory analysis.data/
: Placeholder for the dataset. The original SHARP dataset cannot be shared. A synthetic dataset that mimics its statistical properties is provided for reproducibility.models/
: Directory for saved model weights (not committed to GitHub).images/
: Contains exported figures such as dashboards, confusion matrices, and ROC curves for the README.
pip install -r requirements.txt
python train.py --csv data/hip_radiology_reports_finalised_SYNTH.csv --model UFNLP/gatortron-base --folds 3
This project builds upon the original work to create a more robust and comprehensive machine learning pipeline. The key modifications include:
- Synthetic Data Generation: A Python script was developed to generate a realistic, synthetic dataset of clinical text and patient outcomes, enabling a fully reproducible analysis while protecting patient privacy.
- Enhanced Code Structure: The code was refactored into distinct, logical sections for data generation, model setup, training, and evaluation.
- Improved Evaluation Workflow: The script now calculates and reports overall average metrics and their 95% confidence intervals across all cross-validation folds.
- Comprehensive Visualization: New plotting functions were added to generate an aggregated ROC curve, average confusion matrix, and a calibration plot.
- Increased Robustness: The script now includes error handling for critical operations like file loading.
The most effective strategy was combining the RoBERTa model with data augmentation, as demonstrated by the metrics from a 3-fold cross-validation run.
Model Variant | F1 Score | Accuracy | Recall | Precision |
---|---|---|---|---|
RoBERTa + Data Augmentation | 0.6917 | 0.5485 | 0.9779 | 0.5355 |
RoBERTa + Data Augmentation + Weight Decay | 0.6886 | 0.5462 | 0.9691 | 0.5351 |
GatorTron + Class Weighting | 0.5249 | 0.4452 | 0.8762 | 0.3771 |
GatorTron Baseline | 0.5244 | 0.4593 | 0.8564 | 0.3779 |
RoBERTa + Class Weighting | 0.5199 | 0.4365 | 0.8746 | 0.3703 |
The analysis revealed that while data augmentation significantly improved the model's ability to identify positive cases (high recall), it did so with a notable trade-off in precision. The model frequently produced false positives, which is a key area for future improvement.
Performance Dashboard Example:
This project is fully reproducible. Please follow these steps in order:
git clone https://github.com/somaia-e/Predicting-Surgical-Selection-for-Arthroplasty.git
cd Predicting-Surgical-Selection-for-Arthroplasty
pip install -r requirements.txt
Step 2: Run the Scripts Navigate to the src directory and execute the main training and evaluation script: cd src python train_and_evaluate.py
5️⃣ Future Work Advanced Data Augmentation: Explore more sophisticated techniques beyond random insertion to improve model generalization. Hyperparameter Tuning: Conduct a more extensive search to find the optimal combination of learning rates, batch sizes, and epochs for the best-performing models. Error Analysis: Perform a deeper analysis of false positive predictions to understand their root causes and develop strategies to mitigate them.
This project is available under the MIT License. See the LICENSE file for details.