This R script extracts structured phenotype data from unstructured doctor's notes.
It cleans, standardizes, maps diagnoses to ICD-10 codes, applies one-hot encoding,
and exports a ready-to-use phenotype matrix for machine learning & statistical analysis.
β
Parses doctorβs notes into structured data using regex & NLP
β
Handles missing values & normalizes blood pressure, weight, age
β
Maps diagnoses to ICD-10 codes for standardization
β
One-hot encodes categorical data (diagnosis & meds) for ML
β
Saves phenotype_matrix.csv for database integration & research
install.packages(c("dplyr", "tidyr", "stringr"))
π Usage
Prepare your raw doctorβs notes in a structured text file.
Run the script to extract structured data:
r
Copy
Edit
source("generate_phenotype_matrix.R")
Upload the phenotype_matrix.csv to your labβs database.
π Output Example
sample_id age weight_kg systolic diastolic diagnosis_Hypertension diagnosis_Diabetes diagnosis_Asthma diagnosis_Cardiovascular_Disease med_Lisinopril med_Metformin med_Albuterol med_Atorvastatin
S001 56 81 140 90 1 0 0 0 1 0 0 0
π₯ Database Integration
If using SQL, run:
library(DBI)
con <- dbConnect(RSQLite::SQLite(), dbname = "lab_database.sqlite")
dbWriteTable(con, "phenotype_data", read.csv("phenotype_matrix.csv"), overwrite = TRUE)
dbDisconnect(con)