Skip to content

Customer churn prediction with Python using synthetic datasets. Includes data generation, feature engineering, and training with Logistic Regression, Random Forest, and Gradient Boosting. Improved pipeline applies hyperparameter tuning and threshold optimization to boost recall. Outputs metrics, reports, and charts.

License

Notifications You must be signed in to change notification settings

AmirhosseinHonardoust/Customer-Churn-Prediction

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Customer Churn Prediction

Predict customer churn on a synthetic dataset using Python. The pipeline includes data generation, feature engineering, model training (Logistic Regression, Random Forest, Gradient Boosting), hyperparameter search, class weighting, selection by PR-AUC, and decision-threshold tuning to balance precision and recall. Outputs metrics, reports, and visualizations.


Features

  • Synthetic customer dataset with realistic behavior signals
  • Models: Logistic Regression, Random Forest, Gradient Boosting
  • Hyperparameter optimization (RandomizedSearchCV) & class weighting
  • Model selection by PR-AUC (Average Precision)
  • Threshold tuning (F2 focus) with precision floor
  • Metrics: Accuracy, Precision, Recall, F1, ROC-AUC, PR-AUC
  • Visuals: ROC, Precision-Recall, Confusion Matrix, Feature Importance
  • Saved artifacts: best model (joblib) & metrics

Project Structure

customer-churn-prediction/
├─ README.md
├─ LICENSE
├─ requirements.txt
├─ data/
│  └─ generate_customers.py
├─ src/
│  ├─ train_models.py
│  └─ utils.py
└─ outputs/
   └─ figures & reports

Setup

python -m venv .venv
# Windows:
.venv\Scripts\activate
# macOS/Linux:
source .venv/bin/activate
pip install -r requirements.txt

Generate Synthetic Data

python data/generate_customers.py --n 10000 --seed 42 --out data/customers.csv

Train & Evaluate

python src/train_models.py --input data/customers.csv --outdir outputs --test-size 0.2 --val-size 0.2 --seed 42

Outputs

  • outputs/metrics.json – model choice, tuned threshold, test metrics
  • outputs/classification_report.txt
  • outputs/roc_curve.png
  • outputs/pr_curve.png
  • outputs/confusion_matrix.png
  • outputs/feature_importance.png
  • outputs/best_model.joblib

Final Results (Logistic Regression)

Key Metrics

Metric Value
Accuracy 83.8%
ROC-AUC 0.823
PR-AUC (AP) 0.562
Recall (Churn) 0.50
Precision (Churn) 0.52

➡️ The model now catches ~50% of churners with precision ~0.52, balancing false positives and recall.


Confusion Matrix

confusion_matrix

ROC Curve

roc_curve

Precision-Recall Curve

pr_curve

Feature Importance

feature_importance

Data Schema

column description
customer_id unique customer ID
age customer age
region {North, South, East, West}
tenure_months months since signup
is_premium premium plan (0/1)
monthly_spend average monthly spend
avg_txn_value average transaction value
txns_last_30d transactions in last 30 days
days_since_last_purchase recency (days)
customer_service_calls support calls in last 90 days
discounts_used_90d discounts used in last 90 days
complaints_90d complaint count
churn target label (0/1)

About

Customer churn prediction with Python using synthetic datasets. Includes data generation, feature engineering, and training with Logistic Regression, Random Forest, and Gradient Boosting. Improved pipeline applies hyperparameter tuning and threshold optimization to boost recall. Outputs metrics, reports, and charts.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages