See below for citation information.
This GitHub repository contains the code used to process raw data to final results for the study Lape, et al. "A survey of pathogenic involvement in non-communicable human diseases" (2025). See below for citation information. It has been updated to include an analysis using Phecodes instead of ICD10 codes as the outcome variable.
This code is made available along with explanatory flowcharts below to enable the replication of the results reported in the associated manuscript. UK Biobank data and TriNetX data must be obtained from the respective organizations.
Many relationships between pathogens and human disease are well-established. However, only a small fraction involve diseases considered non-communicable (NCDs). In this study, we sought to leverage the vast amount of newly available electronic health record data to identify potentially novel pathogen-NCD associations and find additional evidence supporting known associations. We leverage data from The UK Biobank and TriNetX to perform a systematic survey across 20 pathogens and 426 diseases, primarily NCDs. To this end, we assess the association between disease status and infection history proxies using a logistic regression-based statistical approach. Our approach identifies 206 pathogen-disease pairs that replicate in both cohorts. We replicate many established relationships, including Helicobacter pylori, with several gastroenterological diseases and connections between Epstein-Barr virus and both multiple sclerosis and lupus. Overall, our approach identifies evidence of association for 15 pathogens and 96 distinct diseases, including a currently controversial link between human cytomegalovirus (CMV) and ulcerative colitis (UC). We validate the CMV-UC connection through two orthogonal analyses, revealing increased CMV gene expression in UC patients and enrichment for UC genetic risk signal near human genes that have altered expression upon CMV infection. Collectively, these results form a foundation for future investigations into mechanistic roles played by pathogens in the processes underlying NCDs. All results are easily accessible on our website, https://tf.cchmc.org/pathogen-disease.
All patient identifiers are generic and don't correspond to actual identifiers from either UK Biobank (UKB) or TriNetX (TNX). They are presented to make it easier to follow the code as well as inputs and outputs.
- R 4.x
- Python 3.x
R Libraries | Python Libraries |
---|---|
DT | numpy |
MASS | pandas |
argparse | scipy |
data.table | sklearn |
dplyr | statsmodels |
glue | matplotlib |
logistf | seaborn |
openxlsx | tabulate |
performance | tqdm |
progress | xlrd |
pryr | |
readxl | |
stringr | |
tidyr | |
vroom | |
writexl | |
PheWAS v0.99.6.1 |
-
GNU Parallel v20220122
Tange, O. (2022, January 22). GNU Parallel 20220122 ('20 years'). Zenodo. https://doi.org/10.5281/zenodo.5893336
Color | Shape |
---|---|
![]() |
![]() |
Code from this repository may be cited as:
Mike Lape and Kevin Ernst. (2025). WeirauchLab/pathogen_ncd. Zenodo. https://doi.org/10.5281/zenodo.8423555
The associated manuscript is Lape, et al., Communications Medicine 2025:
Lape, M., Schnell, D., Parameswaran, S. et al. A survey of pathogenic involvement in non-communicable human diseases. Commun Med 5, 242 (2025). https://doi.org/10.1038/s43856-025-00956-x
Please contact the co-corresponding authors of the manuscript via email with any questions or suggestions.
Name | Institution | Remarks |
---|---|---|
Mike Lape, PhD | University of Cincinnati | primary author |
Kevin Ernst | Cincinnati Children's Hospital | contributor |
Analysis source code is © 2023–2025 Cincinnati Children's Hospital Medical
Center and Mike Lape. Web site source code (the web
subdirectory) is ©
2023–2025 Cincinnati Children's Hospital Medical Center, Mike Lape, and Kevin
Ernst.
Released under the terms of the GNU General Public License, Version 3. See
LICENSE.txt