ScottishLabData - A pipeline for harmonising NHS Scotland laboratory data to enable national-level analyses
If you are using this software, please reference the following paper:
Gao, C., Mumtaz, S., McCall, S., O’Sullivan, K., McGilchrist, M., Morales, D.R., Hall, C., Wilde, K., Mayor, C., Linksted, P. and Harrison, K., 2025. A pipeline for harmonising NHS Scotland laboratory data to enable national-level analyses. Journal of Biomedical Informatics, p.104771. https://doi.org/10.1016/j.jbi.2024.104771
In Scotland, laboratory data is captured, stored and reported from local health board systems with significant heterogeneity. For researchers or other users, working on laboratory datasets across regional cohorts requires effort and time.
We completed the required data governance applications and extracted laboratory tests (biochemistry, haematology, immunology, virology, pathology and microbiology) from all registered SHARE participants who were residents from 01/01/2015 to 31/12/2021 (or until death) in the region of each of the four regional Safe Havens.
The data was provided as flat files from the Safe Havens to HIC and loaded into a Microsoft Structured Query Language (SQL) Server database (version 16.0.1) for ease of handling the large data volumes. Using WhiteRabbit, sampling one million rows in each flat file, we tested the data types, completeness (fraction empty) and consistency (N unique and fraction unique) of each Safe Haven’s laboratory results. We used the Rabbit-In-a-Hat software to read and display WhiteRabbit scan results. Rabbit-In-a-Hat allows a user to connect source data to tables and columns within the target data structure through a graphical user interface. Based on SQL skeleton auto-generated by Rabbit-In-a-Hat, further editing of the mapping was made based on consulting local Safe Havens to establish the final SQL script for structural harmonisation.
Data exploration and harmonisation were conducted using a mixture of SQL and R (version 4.1.3).
Repository sub-directories:
-
SQL code for table structure/ SQL code for harmonising the structure of data, standardising the censored results as well as changing all text codes to read codes
-
R code for data harmonisation/ R scripts to select the 180 codes, standardising the test unit as well as visualisation.
-
Paper Appendix/ Appendix for published paper related to this work
Install SQL server (https://learn.microsoft.com/en-us/sql/sql-server/what-s-new-in-sql-server-2019?view=sql-server-ver16 )
Import sample data into SQL Server
When open SSMS, the connect to server page will automatically load the local SQL server that is running. Click the connect button and connect to the sql server.
Right click the databases, then select New Database from the drop down menu
Set the database name and click OK
In SQL Server Management Studio, connect to an instance of the SQL Server Database Engine. Expand Databases. Right-click a database. Point to Tasks. Select Import Data. And following the Wizard.
Check the tables has been successfully imported by selecting the data table then right-click and select “Select Top 1000 Rows”.
Run the following SQL script “a DaSH2FHIRscot.sql”, “a Glasgow2FHIRscot.sql”, “a HIC2FHIRscot”, “a Lothian2FHIRscot”
• After running all four, you will have four new tables as highlighted
• Optional code (SQL code for table structure/b create index.sql) to increase the efficiency of the sql code
• (SQL code for table structure/c readcode aggregates.sql) conducts test frequency analysis based on the FHIR data structure for each Safe Havens data. It generates four new tables <DaSH_ReadCodeAggregates, Glasgow_ReadCodeAggregates, HIC_ReadCodeAggregates and Lothian_ReadCodeAggregates> which contains the unique code and the frequency counts. The analysis results based on the example data is just for demonstration purposes. The test frequency based on real data has been reported in the paper.
Set up SQL login in MSSQL and give permission for R to connect to the example database
• Add new login by Security ->Logins ->new login
For the example, please use UID = "examplelogin", PWD="password",
Run the R script code following the step 1 to 7
• Make sure to create two empty folders in the R project folder, one called “data” one called “plot”
This is research data for only a subset of health board data and no inference or implication should be made regarding relative quality, accuracy, or performance of individual health boards.
We acknowledge the support of the Grampian Data Safe Haven (DaSH) facility within the Aberdeen Centre for Health Data Science and the associated financial support of the University of Aberdeen, and NHS Research Scotland (through NHS Grampian investment in DaSH). This work was supported by DataLoch (dataloch.org), which is core-funded by the Data-Driven Innovation programme within the Edinburgh and South East Scotland City Region Deal (ddi.ac.uk) and the Chief Scientist Office, Scottish Government. The authors would like to acknowledge the support of the DataLoch service for their involvement in obtaining approvals and provisioning access to data from the Lothian region. The authors would like to acknowledge the work of the West of Scotland Safe Haven team at NHS Greater Glasgow and Clyde in supporting the extractions and linkage to de-identified NHS patient datasets. We acknowledge the support of the Health Informatics Centre, University of Dundee for managing and supplying the anonymised data and NHS Tayside/Fife the original data source. This project was supported by funding from Research Data Scotland (RDS) through the Systems Development Fund.