Data Quality Library (dqLib): An R Package for Explainable and Traceable Assessments of Clinical Data Quality
The Data Quality Library (dqLib
) is an R package for explainable and traceable data quality (DQ) assessments. This package provides generic methods for calculating DQ metrics and generating reports on detected DQ issues, especially in clinical care and research. dqLib
also provides specific functions for reporting on DQ issues that may arise in the context of cardiovascular diseases (CVDs), and rare diseases (RDs). The reports offer adequate information to explain the detected DQ issues and help users trace them back to their sources and underlying causes. dqLib
was validated using real-world data and applied to different use cases for both rare and common diseases [1-3]. The latest release enables the detection and visualization of plausibility issues based on predefined logical and mathematical rules. To improve usability, this version allows users to specify DQ rules using spreadsheets. Exemplary visualizations and DQ reports are available in section 4, while further details on the developed functions are given in the news.
You can install dqLib
directly from GitHub by running the following command:
devtools::install_github("https://github.com/KaisTahar/dqLib")
To install dqLib
, you can also clone the code repository of the desired version or download it, and then run the following command from the local folder:
devtools::install_local("./dqLib")
dqLib
provides multiple metrics and reporting functions to analyze different aspects of DQ. The implemented functions enable users to select appropriate indicators and generate customized DQ reports. The following generic DQ Indicators are already implemented:
DQ Indicator | DQ Dimension | |
---|---|---|
Abbreviation | Name | |
dqi_co_icr | Item Completeness Rate | completeness |
dqi_co_vcr | Value Completeness Rate | |
dqi_co_scr | Subject Completeness Rate | |
dqi_pl_rpr | Range Plausibility Rate | Plausibility |
dqi_pl_spr | Semantic Plausibility Rate |
In addition to indicators, the reports include resulting parameters and offer adequate information to help users resolve the detected DQ issues. dqLib
provides functions to report on the following DQ issues and related parameters:
Abbreviation | DQ Parameter | Description |
---|---|---|
im_misg | missing mandatory data items | number of missing mandatory data items |
vm_misg | missing mandatory data values | number of missing mandatory data values |
s_inc | incomplete subjects | number of incomplete subject records |
vo | outlier values | number of detected outlier values |
vc | contradictory values | number of detected contradictory data values |
dqLib
also provides functions to assess the following specific indicators for RD data:
DQ Indicator | DQ Dimension | |
---|---|---|
Abbreviation | Name | |
dqi_un_cur | RD Case unambiguity Rate | Uniqueness |
dqi_un_cdr | RD Case Dissimilarity Rate | |
dqi_co_icr | Orphacoding Completeness Rate | Completeness |
dqi_pl_opr | Orphacoding Plausibility Rate | Plausibility |
dqi_cc_rvl | Concordance with Reference Values from Literature | Concordance |
Moreover, dqLib
enables annual assessments of selected DQ parameters. The following RD-specific metrics are already implemented:
Abbreviation | DQ Parameter | Description |
---|---|---|
rdCase | RD cases | number of RD cases |
orphaCase | Orpha cases | number of available orpha-coded cases |
tracerCase | tracer cases | number of tracer cases |
rdCase_rel | RD cases rel. frequency | relative frequency of RD cases |
orphaCase_rel | Orpha cases rel. frequency | relative frequency of Orpha cases normalized to 100.000 inpatient cases |
tracerCase_rel | tracer cases rel. frequency | relative frequency of tracer cases normalized to 100.000 inpatient cases |
tracerCase_rel_min | minimal tracer cases in reference values | min. rel. frequency of tracer cases normalized to 100.000 inpatient cases found in the literature |
tracerCase_rel_max | maximal tracer cases in reference values | max. rel. frequency of tracer cases normalized to 100.000 inpatient cases found in the literature |
vm_case_misg | missing mandatory data values in case module | number of missing mandatory data values in the case module |
rdCase_amb | ambiguous RD cases | number of ambiguous RD cases |
rdCase_dup | duplicated RD cases | number of duplicated RD cases |
oc_misg | missing Orphacodes | number of missing Orphacodes by tracer diagnoses |
link_ip | implausible links | number of implausible ICD-10-GM/OC links |
The following references are required to assess the quality of RD documentation: (1) Current Version of Alpha-ID-SE Terminology [4], and (2) a reference for tracer diagnoses such as the list provided in [1].
- CordDqChecker: A reporting tool for DQ assessment on RD data implemented using
dqLib
. The code repository ofCordDqChecker
includes some examples of DQ reports generated using synthetic data. - CvdDqChecker: A reporting tool for assessing the quality of CVD data. This tool was also implemented using
dqLib
. The ./Export folder contains exemplary DQ reports and visualizations.
-
To cite
dqLib
, please use the CITATION file in the folder./inst
. -
Acknowledgment: This work was partially funded by the German Center for Cardiovascular Research (DZHK), grant number 81X1300117, and the "Collaboration on Rare Diseases" of the Medical Informatics Initiative (CORD-MI) under grant number: 01ZZ1911R, FKZ-01ZZ1911R.
[1] Tahar et al. Rare Diseases in Hospital Information Systems — An Interoperable Methodology for Distributed Data Quality Assessments. DOI: 10.1055/a-2006-1018
[2] Tahar et al. Local Data Quality Assessments on EHR-Based Real-World Data for Rare Diseases. DOI: 10.3233/SHTI230121
[3] Tahar K, CvdDqChecker: A Software Solution for Explainable and Traceable Assessments of Cardiovascular Disease Data Quality. Available from GitHub
[4] BfArM - Alpha-ID-SE. Available from BfArM