Predict customer churn on a synthetic dataset using Python. The pipeline includes data generation, feature engineering, model training (Logistic Regression, Random Forest, Gradient Boosting), hyperparameter search, class weighting, selection by PR-AUC, and decision-threshold tuning to balance precision and recall. Outputs metrics, reports, and visualizations.
- Synthetic customer dataset with realistic behavior signals
- Models: Logistic Regression, Random Forest, Gradient Boosting
- Hyperparameter optimization (RandomizedSearchCV) & class weighting
- Model selection by PR-AUC (Average Precision)
- Threshold tuning (F2 focus) with precision floor
- Metrics: Accuracy, Precision, Recall, F1, ROC-AUC, PR-AUC
- Visuals: ROC, Precision-Recall, Confusion Matrix, Feature Importance
- Saved artifacts: best model (joblib) & metrics
customer-churn-prediction/
├─ README.md
├─ LICENSE
├─ requirements.txt
├─ data/
│  └─ generate_customers.py
├─ src/
│  ├─ train_models.py
│  └─ utils.py
└─ outputs/
   └─ figures & reports
python -m venv .venv
# Windows:
.venv\Scripts\activate
# macOS/Linux:
source .venv/bin/activate
pip install -r requirements.txtpython data/generate_customers.py --n 10000 --seed 42 --out data/customers.csvpython src/train_models.py --input data/customers.csv --outdir outputs --test-size 0.2 --val-size 0.2 --seed 42Outputs
- outputs/metrics.json– model choice, tuned threshold, test metrics
- outputs/classification_report.txt
- outputs/roc_curve.png
- outputs/pr_curve.png
- outputs/confusion_matrix.png
- outputs/feature_importance.png
- outputs/best_model.joblib
| Metric | Value | 
|---|---|
| Accuracy | 83.8% | 
| ROC-AUC | 0.823 | 
| PR-AUC (AP) | 0.562 | 
| Recall (Churn) | 0.50 | 
| Precision (Churn) | 0.52 | 
➡️ The model now catches ~50% of churners with precision ~0.52, balancing false positives and recall.
 
 
 
 
| column | description | 
|---|---|
| customer_id | unique customer ID | 
| age | customer age | 
| region | {North, South, East, West} | 
| tenure_months | months since signup | 
| is_premium | premium plan (0/1) | 
| monthly_spend | average monthly spend | 
| avg_txn_value | average transaction value | 
| txns_last_30d | transactions in last 30 days | 
| days_since_last_purchase | recency (days) | 
| customer_service_calls | support calls in last 90 days | 
| discounts_used_90d | discounts used in last 90 days | 
| complaints_90d | complaint count | 
| churn | target label (0/1) |