ScottishLabData - A pipeline for harmonising NHS Scotland laboratory data to enable national-level analyses

Reference

If you are using this software, please reference the following paper:

Gao, C., Mumtaz, S., McCall, S., O’Sullivan, K., McGilchrist, M., Morales, D.R., Hall, C., Wilde, K., Mayor, C., Linksted, P. and Harrison, K., 2025. A pipeline for harmonising NHS Scotland laboratory data to enable national-level analyses. Journal of Biomedical Informatics, p.104771. https://doi.org/10.1016/j.jbi.2024.104771

Background

In Scotland, laboratory data is captured, stored and reported from local health board systems with significant heterogeneity. For researchers or other users, working on laboratory datasets across regional cohorts requires effort and time.

Description

We completed the required data governance applications and extracted laboratory tests (biochemistry, haematology, immunology, virology, pathology and microbiology) from all registered SHARE participants who were residents from 01/01/2015 to 31/12/2021 (or until death) in the region of each of the four regional Safe Havens.

The data was provided as flat files from the Safe Havens to HIC and loaded into a Microsoft Structured Query Language (SQL) Server database (version 16.0.1) for ease of handling the large data volumes. Using WhiteRabbit, sampling one million rows in each flat file, we tested the data types, completeness (fraction empty) and consistency (N unique and fraction unique) of each Safe Haven’s laboratory results. We used the Rabbit-In-a-Hat software to read and display WhiteRabbit scan results. Rabbit-In-a-Hat allows a user to connect source data to tables and columns within the target data structure through a graphical user interface. Based on SQL skeleton auto-generated by Rabbit-In-a-Hat, further editing of the mapping was made based on consulting local Safe Havens to establish the final SQL script for structural harmonisation.

Data exploration and harmonisation were conducted using a mixture of SQL and R (version 4.1.3).

This repository

Repository sub-directories:

SQL code for table structure/ SQL code for harmonising the structure of data, standardising the censored results as well as changing all text codes to read codes
R code for data harmonisation/ R scripts to select the 180 codes, standardising the test unit as well as visualisation.
Paper Appendix/ Appendix for published paper related to this work

Running code on the example

Step 1

Install SQL server (https://learn.microsoft.com/en-us/sql/sql-server/what-s-new-in-sql-server-2019?view=sql-server-ver16 )

Step 2

Install SSMS https://learn.microsoft.com/en-us/sql/ssms/download-sql-server-management-studio-ssms?view=sql-server-ver16&redirectedfrom=MSDN

Step 3

Import sample data into SQL Server

3.1 Connect to the sql through SSMS

When open SSMS, the connect to server page will automatically load the local SQL server that is running. Click the connect button and connect to the sql server.

3.2 Create a new database

Right click the databases, then select New Database from the drop down menu

Set the database name and click OK

3.3 Load the example data into the sql database

In SQL Server Management Studio, connect to an instance of the SQL Server Database Engine. Expand Databases. Right-click a database. Point to Tasks. Select Import Data. And following the Wizard.

Check the tables has been successfully imported by selecting the data table then right-click and select “Select Top 1000 Rows”.

3.4 Repeat step 3.3 for all 7 csv files in \ScottishLabData\ExemplarTestData folder

Step 4

Run the following SQL script “a DaSH2FHIRscot.sql”, “a Glasgow2FHIRscot.sql”, “a HIC2FHIRscot”, “a Lothian2FHIRscot”

• After running all four, you will have four new tables as highlighted

• Optional code (SQL code for table structure/b create index.sql) to increase the efficiency of the sql code

• (SQL code for table structure/c readcode aggregates.sql) conducts test frequency analysis based on the FHIR data structure for each Safe Havens data. It generates four new tables <DaSH_ReadCodeAggregates, Glasgow_ReadCodeAggregates, HIC_ReadCodeAggregates and Lothian_ReadCodeAggregates> which contains the unique code and the frequency counts. The analysis results based on the example data is just for demonstration purposes. The test frequency based on real data has been reported in the paper.

Step 5

Set up SQL login in MSSQL and give permission for R to connect to the example database

• Add new login by Security ->Logins ->new login

For the example, please use UID = "examplelogin", PWD="password",

Step 6

Run the R script code following the step 1 to 7

• Make sure to create two empty folders in the R project folder, one called “data” one called “plot”

Acknowledgments

This is research data for only a subset of health board data and no inference or implication should be made regarding relative quality, accuracy, or performance of individual health boards.

We acknowledge the support of the Grampian Data Safe Haven (DaSH) facility within the Aberdeen Centre for Health Data Science and the associated financial support of the University of Aberdeen, and NHS Research Scotland (through NHS Grampian investment in DaSH). This work was supported by DataLoch (dataloch.org), which is core-funded by the Data-Driven Innovation programme within the Edinburgh and South East Scotland City Region Deal (ddi.ac.uk) and the Chief Scientist Office, Scottish Government. The authors would like to acknowledge the support of the DataLoch service for their involvement in obtaining approvals and provisioning access to data from the Lothian region. The authors would like to acknowledge the work of the West of Scotland Safe Haven team at NHS Greater Glasgow and Clyde in supporting the extractions and linkage to de-identified NHS patient datasets. We acknowledge the support of the Health Informatics Centre, University of Dundee for managing and supplying the anonymised data and NHS Tayside/Fife the original data source. This project was supported by funding from Research Data Scotland (RDS) through the Systems Development Fund.

Name		Name	Last commit message	Last commit date
Latest commit History 67 Commits
ExemplarTestData		ExemplarTestData
Paper Appendix		Paper Appendix
R code for data harmonisation		R code for data harmonisation
SQL code for table structure		SQL code for table structure
readmefigures		readmefigures
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

ScottishLabData - A pipeline for harmonising NHS Scotland laboratory data to enable national-level analyses

Reference

Background

Description

This repository

Running code on the example

Step 1

Step 2

Step 3

3.1 Connect to the sql through SSMS

3.2 Create a new database

3.3 Load the example data into the sql database

3.4 Repeat step 3.3 for all 7 csv files in \ScottishLabData\ExemplarTestData folder

Step 4

Step 5

Step 6

Acknowledgments

About

Uh oh!

Releases 3

Packages

Contributors 2

Uh oh!

Languages

License

HicResearch/ScottishLabData

Folders and files

Latest commit

History

Repository files navigation

ScottishLabData - A pipeline for harmonising NHS Scotland laboratory data to enable national-level analyses

Reference

Background

Description

This repository

Running code on the example

Step 1

Step 2

Step 3

3.1 Connect to the sql through SSMS

3.2 Create a new database

3.3 Load the example data into the sql database

3.4 Repeat step 3.3 for all 7 csv files in \ScottishLabData\ExemplarTestData folder

Step 4

Step 5

Step 6

Acknowledgments

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 3

Packages 0

Contributors 2

Uh oh!

Languages

Packages