Skip to content

A lightweight R script for text mining and harmonizing medical phenotype data. Cleans, standardizes, and maps diagnoses to ICD-10 codes, with clinical annotations for enhanced data usability.

Notifications You must be signed in to change notification settings

jcaperella29/clinical-text-mining_R_SCRIPT

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

5 Commits
Β 
Β 
Β 
Β 

Repository files navigation

clinical-text-mining_R_SCRIPT#

πŸ₯ Medical Phenotype Extraction from Doctor's Notes 🩺

πŸ“œ Overview

This R script extracts structured phenotype data from unstructured doctor's notes.
It cleans, standardizes, maps diagnoses to ICD-10 codes, applies one-hot encoding,
and exports a ready-to-use phenotype matrix for machine learning & statistical analysis.

πŸ”¬ Features

βœ… Parses doctor’s notes into structured data using regex & NLP
βœ… Handles missing values & normalizes blood pressure, weight, age
βœ… Maps diagnoses to ICD-10 codes for standardization
βœ… One-hot encodes categorical data (diagnosis & meds) for ML
βœ… Saves phenotype_matrix.csv for database integration & research


βš™οΈ Installation & Dependencies

install.packages(c("dplyr", "tidyr", "stringr"))

πŸš€ Usage
Prepare your raw doctor’s notes in a structured text file.
Run the script to extract structured data:
r
Copy
Edit
source("generate_phenotype_matrix.R")
Upload the phenotype_matrix.csv to your lab’s database.
πŸ“‚ Output Example
sample_id	age	weight_kg	systolic	diastolic	diagnosis_Hypertension	diagnosis_Diabetes	diagnosis_Asthma	diagnosis_Cardiovascular_Disease	med_Lisinopril	med_Metformin	med_Albuterol	med_Atorvastatin
S001	56	81	140	90	1	0	0	0	1	0	0	0
πŸ₯ Database Integration
If using SQL, run:


library(DBI)
con <- dbConnect(RSQLite::SQLite(), dbname = "lab_database.sqlite")
dbWriteTable(con, "phenotype_data", read.csv("phenotype_matrix.csv"), overwrite = TRUE)
dbDisconnect(con)

About

A lightweight R script for text mining and harmonizing medical phenotype data. Cleans, standardizes, and maps diagnoses to ICD-10 codes, with clinical annotations for enhanced data usability.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages