This repository demonstrates an end-to-end machine learning pipeline for binary classification using the Pima Indians Diabetes dataset. It applies both a tuned neural network and XGBoost classifier, along with data preprocessing, exploratory data analysis, outlier handling, and visualization.
Diabetes is a chronic condition with significant public health impact. The goal is to accurately classify whether a patient has diabetes based on medical measurements.
- Source: UCI Machine Learning Repository
- Observations: 768 patients
- Features:
- Pregnancies
- Glucose
- BloodPressure
- SkinThickness
- Insulin
- BMI
- DiabetesPedigreeFunction
- Age
- Outcome (target variable: 1 for diabetic, 0 for non-diabetic)
✔️ Exploratory data analysis with distribution plots and correlation matrix
✔️ Outlier detection using the IQR method
✔️ Missing/impossible values imputed with medians
✔️ Data standardization
✔️ Deep Learning (Neural Network with keras-tuner
for optimization)
✔️ Tree-Based Model (XGBoost) for benchmarking
✔️ ROC/AUC comparisons and confusion matrix visualizations
✔️ Fully modular and reproducible codebase
Model | Library | Notes |
---|---|---|
Neural Network | TensorFlow/Keras | Tuned using keras-tuner (RandomSearch) |
XGBoost | XGBoost | Strong tree-based benchmark |
git clone https://github.com/yourusername/pima-diabetes-prediction.git
cd pima-diabetes-prediction
pip install -r requirements.txt