Skip to content

Data Quality Library (dqLib): An R Package for Explainable and Traceable Assessments of Clinical Data Quality

License

Notifications You must be signed in to change notification settings

KaisTahar/dqLib

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Data Quality Library (dqLib): An R Package for Explainable and Traceable Assessments of Clinical Data Quality

DOI License: GPL v3

1. Description

The Data Quality Library (dqLib) is an R package for explainable and traceable data quality (DQ) assessments. This package provides generic methods for calculating DQ metrics and generating reports on detected DQ issues, especially in clinical care and research. dqLib also provides specific functions for reporting on DQ issues that may arise in the context of cardiovascular diseases (CVDs), and rare diseases (RDs). The reports offer adequate information to explain the detected DQ issues and help users trace them back to their sources and underlying causes. dqLib was validated using real-world data and applied to different use cases for both rare and common diseases [1-3]. The latest release enables the detection and visualization of plausibility issues based on predefined logical and mathematical rules. To improve usability, this version allows users to specify DQ rules using spreadsheets. Exemplary visualizations and DQ reports are available in section 4, while further details on the developed functions are given in the news.

2. Installation

You can install dqLib directly from GitHub by running the following command:

devtools::install_github("https://github.com/KaisTahar/dqLib") 

To install dqLib, you can also clone the code repository of the desired version or download it, and then run the following command from the local folder:

devtools::install_local("./dqLib")

3. DQ Metrics and Reports

dqLib provides multiple metrics and reporting functions to analyze different aspects of DQ. The implemented functions enable users to select appropriate indicators and generate customized DQ reports. The following generic DQ Indicators are already implemented:

DQ Indicator DQ Dimension
Abbreviation Name
dqi_co_icr Item Completeness Rate completeness
dqi_co_vcr Value Completeness Rate
dqi_co_scr Subject Completeness Rate
dqi_pl_rpr Range Plausibility Rate Plausibility
dqi_pl_spr Semantic Plausibility Rate


In addition to indicators, the reports include resulting parameters and offer adequate information to help users resolve the detected DQ issues. dqLib provides functions to report on the following DQ issues and related parameters:

Abbreviation DQ Parameter Description
im_misg missing mandatory data items number of missing mandatory data items
vm_misg missing mandatory data values number of missing mandatory data values
s_inc incomplete subjects number of incomplete subject records
vo outlier values number of detected outlier values
vc contradictory values number of detected contradictory data values


dqLib also provides functions to assess the following specific indicators for RD data:

DQ Indicator DQ Dimension
Abbreviation Name
dqi_un_cur RD Case unambiguity Rate Uniqueness
dqi_un_cdr RD Case Dissimilarity Rate
dqi_co_icr Orphacoding Completeness Rate Completeness
dqi_pl_opr Orphacoding Plausibility Rate Plausibility
dqi_cc_rvl Concordance with Reference Values from Literature Concordance


Moreover, dqLib enables annual assessments of selected DQ parameters. The following RD-specific metrics are already implemented:

Abbreviation DQ Parameter Description
rdCase RD cases number of RD cases
orphaCase Orpha cases number of available orpha-coded cases
tracerCase tracer cases number of tracer cases
rdCase_rel RD cases rel. frequency relative frequency of RD cases
orphaCase_rel Orpha cases rel. frequency relative frequency of Orpha cases normalized to 100.000 inpatient cases
tracerCase_rel tracer cases rel. frequency relative frequency of tracer cases normalized to 100.000 inpatient cases
tracerCase_rel_min minimal tracer cases in reference values min. rel. frequency of tracer cases normalized to 100.000 inpatient cases found in the literature
tracerCase_rel_max maximal tracer cases in reference values max. rel. frequency of tracer cases normalized to 100.000 inpatient cases found in the literature
vm_case_misg missing mandatory data values in case module number of missing mandatory data values in the case module
rdCase_amb ambiguous RD cases number of ambiguous RD cases
rdCase_dup duplicated RD cases number of duplicated RD cases
oc_misg missing Orphacodes number of missing Orphacodes by tracer diagnoses
link_ip implausible links number of implausible ICD-10-GM/OC links

The following references are required to assess the quality of RD documentation: (1) Current Version of Alpha-ID-SE Terminology [4], and (2) a reference for tracer diagnoses such as the list provided in [1].

4. Examples

  • CordDqChecker: A reporting tool for DQ assessment on RD data implemented using dqLib. The code repository of CordDqChecker includes some examples of DQ reports generated using synthetic data.
  • CvdDqChecker: A reporting tool for assessing the quality of CVD data. This tool was also implemented using dqLib. The ./Export folder contains exemplary DQ reports and visualizations.

5. Notes

  • To cite dqLib, please use the CITATION file in the folder ./inst.

  • Acknowledgment: This work was partially funded by the German Center for Cardiovascular Research (DZHK), grant number 81X1300117, and the "Collaboration on Rare Diseases" of the Medical Informatics Initiative (CORD-MI) under grant number: 01ZZ1911R, FKZ-01ZZ1911R.

6. References

[1] Tahar et al. Rare Diseases in Hospital Information Systems — An Interoperable Methodology for Distributed Data Quality Assessments. DOI: 10.1055/a-2006-1018

[2] Tahar et al. Local Data Quality Assessments on EHR-Based Real-World Data for Rare Diseases. DOI: 10.3233/SHTI230121

[3] Tahar K, CvdDqChecker: A Software Solution for Explainable and Traceable Assessments of Cardiovascular Disease Data Quality. Available from GitHub

[4] BfArM - Alpha-ID-SE. Available from BfArM

About

Data Quality Library (dqLib): An R Package for Explainable and Traceable Assessments of Clinical Data Quality

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • R 100.0%