This project demonstrates a beginner-friendly implementation of Logistic Regression to classify breast cancer tumors as benign or malignant, using the Breast Cancer Wisconsin dataset.
- Preprocess and clean the dataset (remove NaNs, encode labels)
- Perform Exploratory Data Analysis (EDA) with visual insights
- Train and evaluate a Logistic Regression classifier
- Tune decision threshold and explain sigmoid output
- Save all visuals and results for GitHub display
- Present the work in a clean, job-ready format
breast-cancer-logistic-regression/ │ ├── data/ # Raw & cleaned dataset ├── images/ # Visuals (histograms, heatmaps, confusion matrix, ROC) ├── notebooks/ │ ├── 01_data_cleaning.ipynb │ ├── 02_eda.ipynb │ └── 03_logistic_model.ipynb ├── requirements.txt # Required libraries ├── LICENSE # MIT License └── README.md # This file
- Source: Kaggle – Breast Cancer Wisconsin Dataset
- Shape: 569 samples, 32 features
- Target:
diagnosis
0
= Benign1
= Malignant
Task | Status |
---|---|
Data Cleaning & Preprocessing | ✅ |
Exploratory Data Analysis (EDA) | ✅ |
Logistic Regression Training | ✅ |
Evaluation (Accuracy, F1, AUC, etc) | ✅ |
Confusion Matrix + ROC Curve | ✅ |
Threshold tuning | ✅ |
All visuals saved in images/ |
✅ |
Metric | Score |
---|---|
Accuracy | 93.86% |
Precision | 0.97 |
Recall | 0.86 |
F1 Score | 0.91 |
ROC-AUC Score | 0.986 |
Confusion matrix and ROC curve visualizations are saved in the images
folder.
- Jupyter Notebook – Interactive coding and documentation
- VS Code – Code editor used for development
- Git – Version control
- GitHub – Project hosting and portfolio building
- Kaggle – Dataset source
- Markdown – For clean documentation
- Python 3.10+ – Language used
Python 3
Pandas
,NumPy
Matplotlib
,Seaborn
scikit-learn
# Clone this repository
git clone https://github.com/your-username/breast-cancer-logistic-regression.git
cd breast-cancer-logistic-regression
# Install required libraries
pip install -r requirements.txt
# Run notebooks
jupyter lab