Skip to content

WeirauchLab/pathogen_ncd

Repository files navigation

A survey of pathogenic involvement in non-communicable human diseases

DOI

Study Design

Study Overview

Repository Description

See below for citation information.

This GitHub repository contains the code used to process raw data to final results for the study Lape, et al. "A survey of pathogenic involvement in non-communicable human diseases" (2025). See below for citation information. It has been updated to include an analysis using Phecodes instead of ICD10 codes as the outcome variable.

This code is made available along with explanatory flowcharts below to enable the replication of the results reported in the associated manuscript. UK Biobank data and TriNetX data must be obtained from the respective organizations.

Manuscript Abstract

Background

Many relationships between pathogens and human disease are well-established. However, only a small fraction involve diseases considered non-communicable (NCDs). In this study, we sought to leverage the vast amount of newly available electronic health record data to identify potentially novel pathogen-NCD associations and find additional evidence supporting known associations.

Methods

We leverage data from The UK Biobank and TriNetX to perform a systematic survey across 20 pathogens and 426 diseases, primarily NCDs. To this end, we assess the association between disease status and infection history proxies using a logistic regression-based statistical approach.

Results

Our approach identifies 206 pathogen-disease pairs that replicate in both cohorts. We replicate many established relationships, including Helicobacter pylori, with several gastroenterological diseases and connections between Epstein-Barr virus and both multiple sclerosis and lupus. Overall, our approach identifies evidence of association for 15 pathogens and 96 distinct diseases, including a currently controversial link between human cytomegalovirus (CMV) and ulcerative colitis (UC). We validate the CMV-UC connection through two orthogonal analyses, revealing increased CMV gene expression in UC patients and enrichment for UC genetic risk signal near human genes that have altered expression upon CMV infection.

Conclusions

Collectively, these results form a foundation for future investigations into mechanistic roles played by pathogens in the processes underlying NCDs. All results are easily accessible on our website, https://tf.cchmc.org/pathogen-disease.

General Notes

All patient identifiers are generic and don't correspond to actual identifiers from either UK Biobank (UKB) or TriNetX (TNX). They are presented to make it easier to follow the code as well as inputs and outputs.

Software Versions

Languages utilized

  • R 4.x
  • Python 3.x

Additional Libraries

R Libraries Python Libraries
DT numpy
MASS pandas
argparse scipy
data.table sklearn
dplyr statsmodels
glue matplotlib
logistf seaborn
openxlsx tabulate
performance tqdm
progress xlrd
pryr
readxl
stringr
tidyr
vroom
writexl
PheWAS v0.99.6.1

Other 3rd party software

Flowcharts for primary analysis using diagnoses and serology data

Key for Diagrams

Color Shape
Color Key Shape Key

ICD10 Analysis

UK Biobank

Data Prep

UKB ICD Data Prep

Analysis

UKB ICD analysis

Permutations and Empirical P-values

UKB ICD Permutations

UKB ICD Permutations Continued

TriNetX

Data Prep

TNX ICD Data Prep

Analysis

TNX ICD Analysis

Results Post-processing

ICD Post-processing

Phecode Analysis

UK Biobank

Data Prep

UKB Phecode Data Prep

Analysis

UKB Phecode analysis

TriNetX

Data Prep

TNX Phecode Data Prep

Analysis

TNX Phecode Analysis

Results Post-processing

Phecode Post-processing

How to Cite

Code from this repository may be cited as:

Mike Lape and Kevin Ernst. (2025). WeirauchLab/pathogen_ncd. Zenodo. https://doi.org/10.5281/zenodo.8423555

The associated manuscript is Lape, et al., Communications Medicine 2025:

Lape, M., Schnell, D., Parameswaran, S. et al. A survey of pathogenic involvement in non-communicable human diseases. Commun Med 5, 242 (2025). https://doi.org/10.1038/s43856-025-00956-x

Feedback

Please contact the co-corresponding authors of the manuscript via email with any questions or suggestions.

Contributors

Name Institution Remarks
Mike Lape, PhD University of Cincinnati primary author
Kevin Ernst Cincinnati Children's Hospital contributor

License

Analysis source code is © 2023–2025 Cincinnati Children's Hospital Medical Center and Mike Lape. Web site source code (the web subdirectory) is © 2023–2025 Cincinnati Children's Hospital Medical Center, Mike Lape, and Kevin Ernst.

Released under the terms of the GNU General Public License, Version 3. See LICENSE.txt

About

After the Infection: A Survey of Pathogens and Non-communicable Human Disease

Resources

License

Stars

Watchers

Forks

Contributors 2

  •  
  •