Diabetes Prediction: Comparing Random Forest & Logistic Regression

📌 Project Overview

This project compares the performance of Random Forest (RF) and Logistic Regression (LR) on the Pima Indians Diabetes Dataset to predict the presence of diabetes in women aged 21+ of Pima Indian heritage. The goal is to evaluate model accuracy, computational efficiency, and robustness in handling imbalanced data. Results are benchmarked against the study by Chang et al. (2022).

Key Questions Addressed:

Which model (RF or LR) performs better in terms of accuracy, AUC, and computational speed?
How do feature importance and correlation impact predictions?
Does hyperparameter tuning significantly improve performance?

🏥 Dataset Details

Source: Kaggle
Samples: 768 women (8 predictors, 1 binary target)
Features: Pregnancies, Glucose, Blood Pressure, Skin Thickness, Insulin, BMI, Diabetes Pedigree Function, Age
Preprocessing:
- Replaced missing values (encoded as 0) with mean/median.
- Normalized predictors using min-max scaling.
- Split into 60:20:20 (train/validation/test).

🛠️ Methodology

Data Cleaning & EDA:
- Handled missing values (mean for Glucose/BP, median for Skin Thickness/Insulin/BMI).
- Analyzed correlations (e.g., Glucose and BMI strongly linked to diabetes).
Model Training:
- Random Forest: Hyperparameter tuning (trees=150, max splits=40, leaf size=30) via 10-fold CV.
- Logistic Regression: Grid search for lambda (best=0.001).
Evaluation Metrics:
- Accuracy, AUC, Precision, Recall, F1-Score.

📊 Results

Model Performance Comparison (Test Set)

Metric	Random Forest	Logistic Regression
Accuracy	78%	77%
AUC	0.82	0.75
Precision	0.74	0.68
Recall	0.65	0.72
F1-Score	0.69	0.70
Training Speed	Slow	Fast

Key Findings:

RF Strengths: Higher AUC (0.82), better precision.
LR Strengths: Faster training, better recall.
Both models struggled with class imbalance (more "no diabetes" predictions).
RF overfits slightly on training data but generalizes well on test data.
Contradiction to Hypothesis: RF had lower TP/FP rates than expected.

📸 Key Visualizations

Figure 1: Frequency Distribution Before & After Imputation

Top: Original dataset with missing values (encoded as 0).
Bottom: After replacing missing values with mean/median.

Figure 2: Correlation Heatmap

Glucose, BMI, and Age show strong positive correlations with diabetes.

Figure 3: Feature Importance (Random Forest vs. Logistic Regression)

RF: Glucose and BMI are top predictors.

LR: Blood Pressure and Diabetes Pedigree Function dominate.

Figure 4: Class Distribution

Imbalanced dataset: ~65% "No Diabetes" vs. ~35% "Diabetes".

Figure 5: ROC Curves

RF AUC = 0.82: Better class separation.
LR AUC = 0.75: Moderate performance.

📹 Presentation Video

Download the video here

📝 Lessons Learned & Future Work

Lessons:

Feature Selection: Critical for LR (e.g., excluding negatively correlated features improved accuracy).
Hyperparameter Tuning: RF requires careful tuning (OOB error may outperform CV for small datasets).
Class Imbalance: Addressing imbalance (e.g., SMOTE) could improve recall.

Future Work:

Experiment with feature engineering and advanced techniques (XGBoost, SVM).
Compare OOB error vs. cross-validation for hyperparameter tuning.
Expand dataset size and apply anomaly detection for outliers.

🔗 References

[Chang, V., Bailey, J., Xu, Q.A. et al. Pima Indians diabetes mellitus classification based on machine learning (ML) algorithms. Neural Comput & Applic 35, 16157–16173 (2023). https://doi.org/10.1007/s00521-022-07049-z]
Full references in docs/presentation.pptx

Name		Name	Last commit message	Last commit date
Latest commit History 58 Commits
code		code
data		data
docs		docs
images		images
models		models
video		video
LICENSE		LICENSE
README.md		README.md
requirement.txt		requirement.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Diabetes Prediction: Comparing Random Forest & Logistic Regression

📌 Project Overview

Key Questions Addressed:

🏥 Dataset Details

🛠️ Methodology

📊 Results

Model Performance Comparison (Test Set)

Key Findings:

📸 Key Visualizations

Figure 1: Frequency Distribution Before & After Imputation

Figure 2: Correlation Heatmap

Figure 3: Feature Importance (Random Forest vs. Logistic Regression)

Figure 4: Class Distribution

Figure 5: ROC Curves

📹 Presentation Video

📝 Lessons Learned & Future Work

Lessons:

Future Work:

🔗 References

About

Uh oh!

Releases

Packages

Languages

License

Arsalaan-Ahmad/Diabetes-Prediction-Comparing-Random-Forest-Logistic-Regression

Folders and files

Latest commit

History

Repository files navigation

Diabetes Prediction: Comparing Random Forest & Logistic Regression

📌 Project Overview

Key Questions Addressed:

🏥 Dataset Details

🛠️ Methodology

📊 Results

Model Performance Comparison (Test Set)

Key Findings:

📸 Key Visualizations

Figure 1: Frequency Distribution Before & After Imputation

Figure 2: Correlation Heatmap

Figure 3: Feature Importance (Random Forest vs. Logistic Regression)

Figure 4: Class Distribution

Figure 5: ROC Curves

📹 Presentation Video

📝 Lessons Learned & Future Work

Lessons:

Future Work:

🔗 References

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages