Integrative Analysis of cfDNA Methylation and miRNA Expression for Early Lung Cancer Detection Using Machine Learning
This project focuses on the early detection of lung cancer using an integrated analysis of two powerful non-invasive biomarkers
Click here to try the Streamlit app
This app uses a trained Random Forest model on cfDNA methylation + miRNA expression features (gene1
, gene2
, gene3
, miRNA_21
, miRNA_34a
) to predict lung cancer probability.
Cell-free DNA (cfDNA) methylation profiles
We built and evaluated multiple machine learning models to classify lung cancer vs normal samples, highlighting key biomarkers and achieving strong predictive performance. This project is suitable for clinical research, diagnostic development, and bioinformatics applications.
cfDNA_LungCancer_ML/ │ ├── data/ │ ├── raw/ │ └── processed/ │ ├── notebooks/ │ ├── 1_Data_Preprocessing.ipynb │ ├── 2_Data_Integration_and_Labeling.ipynb │ ├── 3_Model_Training_and_Evaluation.ipynb │ ├── 4_Model_Evaluation_and_Visualization.ipynb │ ├── 5_Model_Inference_and_Usage.ipynb │ └── 6_Research_Report_and_Literature_Review.ipynb │ ├── results/ │ └── plots/ │ ├── models/ ├── streamlit_app/ │ └── app.py │ ├── requirements.txt └── README.md
-
Integrate cfDNA methylation & miRNA expression data.
-
Train classification models for early lung cancer detection.
-
Identify top biomarkers using model interpretability.
-
Visualize model performance (ROC, confusion matrix).
-
Deploy prediction interface using Streamlit.
-
Python (NumPy, Pandas, Matplotlib, Seaborn)
-
Machine Learning (Scikit-learn, XGBoost)
-
Bioinformatics concepts (cfDNA, miRNA biomarker profiling)
-
Deployment: Streamlit
-
Model Persistence: joblib
-
Data Source: TCGA-LUAD datasets
This project uses two primary datasets focused on early-stage lung cancer detection through blood-based biomarkers:
- Type: Cell-free DNA (cfDNA) methylation beta-values
- Source: Public repository (GEO / published supplementary data)
- Shape: 200+ samples × ~10,000 CpG sites
- Preprocessing:
- Removed low-variance CpGs
- Filtered NA/missing values
- Normalized values if necessary
- Target Labels:
Benign
vsMalignant
(based on clinical metadata)
- Type: Circulating miRNA expression profiles from patient plasma/serum
- Source: Public domain / research publication
- Shape: 200+ samples × ~300 miRNAs
- Preprocessing:
- Removed low-expression features
- Filtered missing values
- Log-transformation applied
- Target Labels: Aligned to cfDNA samples via patient ID
After aligning both datasets by patient ID, a merged matrix was created containing:
- cfDNA methylation features
- miRNA expression features
- Combined target label (Benign vs Malignant)
This merged dataset was used for feature selection and model training.
-
Logistic Regression
-
Random Forest Classifier
-
Support Vector Machine (SVM)
-
Accuracy
-
Precision, Recall, F1-Score
-
Confusion Matrix
-
ROC-AUC Curve
Model | Accuracy | AUC Score |
---|---|---|
Logistic Regression | ~72% | 0.81 |
Random Forest | ~75% | 0.84 |
SVM (RBF Kernel) | ~73% | 0.82 |
Random Forest highlighted top 20 important integrated features. All models showed balanced performance.
This section outlines the complete workflow for our integrative analysis project on cfDNA methylation and miRNA expression for early lung cancer detection using machine learning:
The following flowchart outlines the complete pipeline of this project:
Workflow_Diagram.png README.md
-
Data Collection
- cfDNA Methylation Data and miRNA Expression Data were gathered from curated sources and stored in structured CSV formats.
-
Data Preprocessing
- Cleaned missing values, normalized features, and merged datasets using a common identifier.
- Final processed file: `merged_labeled_light.csv`
-
Feature Engineering
- Selected most informative features from both cfDNA and miRNA matrices.
- Removed non-informative or redundant columns.
-
Label Assignment
- Assigned binary labels:
- `0`: Healthy/Control
- `1`: Cancer/Affected
- Assigned binary labels:
-
Model Training and Evaluation
- Built multiple ML models (Random Forest, XGBoost, etc.).
- Performed hyperparameter tuning and cross-validation.
- Selected best model based on accuracy, F1-score, and ROC-AUC.
-
Streamlit App Development
- Developed an interactive web app using Streamlit.
- Includes:
- Visualizations
- Model metrics
- Live prediction area for user input
- Integrated with GitHub and deployed via Streamlit Cloud.
-
Documentation & Deployment
- Full codebase documented and version-controlled on GitHub.
- Results visualized and interpreted.
- Repository structured for reuse and reproducibility.
-
Install dependencies: pip install -r requirements.txt
-
Run Jupyter Notebooks:
. 1_Data_Preprocessing.ipynb . 2_Data_Integration_and_Labeling.ipynb . 3_Model_Training_and_Evaluation.ipynb . 4_Model_Evaluation_and_Visualization.ipynb . 5_Model_Inference_and_Usage.ipynb
-
Launch Streamlit app: cd streamlit_app streamlit run app.py
-
Sample Input for Prediction The app accepts scaled cfDNA and miRNA expression values and classifies the input as:
-> Normal
-> Lung Cancer
-
TCGA-LUAD: The Cancer Genome Atlas – Lung Adenocarcinoma
-
GEO Datasets for cfDNA methylation and miRNA
-
Latest studies on non-invasive biomarkers in cancer detection
Sanjai C. MS Bioiformatics & Immunobiology Amrita School of Boitechnology, Amrita Vishwa Vidyapeetham University of Arizona
Email: sanjaichippukutty@gmail.com
@misc{chippukutty2025cfDNAmiRNA,
author = {Sanjai Chippukutty},
title = {Integrative Analysis of cfDNA Methylation and miRNA Expression for Early Cancer Detection Using Machine Learning},
year = {2025},
url = {https://github.com/Sanjai-Chippukutty/cfDNA-Lung-Cancer-ML},
note = {GitHub repository}
}
# cfDNA-Lung-Cancer-ML
Integrative analysis of cfDNA methylation and miRNA expression for early lung cancer detection using machine learning.