Customer Churn Prediction

A comprehensive machine learning project for predicting customer churn in telecommunications using exploratory data analysis and multiple classification algorithms.

📊 Project Overview

This project analyzes customer churn patterns in a telecommunications dataset and builds predictive models to identify customers likely to leave the service. The analysis includes extensive exploratory data analysis (EDA), feature engineering, and comparison of multiple machine learning algorithms.

🗂️ Project Structure

churn_prediction/
├── churn.csv              # Customer data with 21 features and churn labels
├── EDA.ipynb             # Exploratory Data Analysis notebook
├── experiment.ipynb      # Machine learning experiments notebook
└── README.md            # Project documentation

📈 Dataset Description

The dataset contains 7,043 customer records with the following features:

Customer Demographics

customerID: Unique customer identifier
gender: Customer gender (Male/Female)
SeniorCitizen: Whether customer is 65+ years old (0/1)
Partner: Whether customer has a partner (Yes/No)
Dependents: Whether customer has dependents (Yes/No)

Service Information

tenure: Number of months customer has stayed
PhoneService: Whether customer has phone service
MultipleLines: Whether customer has multiple lines
InternetService: Type of internet service (DSL/Fiber optic/No)
OnlineSecurity: Whether customer has online security
OnlineBackup: Whether customer has online backup
DeviceProtection: Whether customer has device protection
TechSupport: Whether customer has tech support
StreamingTV: Whether customer has streaming TV
StreamingMovies: Whether customer has streaming movies

Account Information

Contract: Contract term (Month-to-month/One year/Two year)
PaperlessBilling: Whether customer has paperless billing
PaymentMethod: Payment method used
MonthlyCharges: Monthly charges amount
TotalCharges: Total charges amount

Target Variable

Churn: Whether customer churned (Yes/No)

🔍 Key Findings from EDA

Data Quality Issues

Missing Values: 11 missing values in TotalCharges column
Data Type Issues: TotalCharges stored as object instead of numeric
Feature Corrections: SeniorCitizen converted from binary to categorical

Churn Patterns

Imbalanced Dataset: Approximately 73% retention vs 27% churn
High-Risk Segments:
- Customers without partners or dependents (higher churn)
- Fiber optic internet users (>50% churn rate)
- Month-to-month contract customers (88.6% churn rate)
- Electronic check payment users (57.3% of churned customers)
- Customers without online services (>50% churn rate)
Low-Risk Segments:
- Long-term contract customers (1-year, 2-year)
- Customers with multiple online services
- DSL internet users

🚀 Machine Learning Experiments

Experiment 1: Baseline Models

Approach: Basic data encoding with standard preprocessing

Logistic Regression: 0.8614 ROC-AUC
Random Forest: 0.8461 ROC-AUC
XGBoost: 0.8358 ROC-AUC
SVM: 0.8256 ROC-AUC
Decision Tree: 0.6841 ROC-AUC

Experiment 2: Feature Engineering

New Features Added:

tenure_bin: Categorized tenure into meaningful periods
NumServicesUsed: Count of additional services
HasInternet: Boolean internet service indicator
IsHighRiskPayment: High-risk payment/contract combination

Results: Slight improvements for Logistic Regression and XGBoost

Logistic Regression: 0.8620 ROC-AUC (+0.0006)
XGBoost: 0.8388 ROC-AUC (+0.0030)

Experiment 3: SMOTE for Class Imbalance

Approach: Applied SMOTE oversampling to address class imbalance

Results: Mixed improvements

SVM: 0.8505 ROC-AUC (+0.0249)
Decision Tree: 0.7323 ROC-AUC (+0.0482)
XGBoost: 0.8447 ROC-AUC (+0.0089)

🏆 Best Model Performance

Logistic Regression consistently performed best across all experiments:

Best ROC-AUC: 0.8620 (Experiment 2 with feature engineering)
Strengths: Stable performance, interpretable results
Model Choice: Recommended for production due to reliability and interpretability

🛠️ Technical Implementation

Prerequisites

pandas
numpy
matplotlib
seaborn
scikit-learn
xgboost
imbalanced-learn

Data Preprocessing Pipeline

Data Type Conversion: Convert TotalCharges to numeric, SeniorCitizen to categorical
Missing Value Handling: Median imputation for numerical, most frequent for categorical
Feature Scaling: StandardScaler for numerical features
Encoding: OneHotEncoder for categorical features
Feature Engineering: Create derived features for better prediction

Model Training Process

Train-Test Split: 80-20 split with stratification
Cross-Validation: Used for model selection and hyperparameter tuning
Evaluation Metrics: ROC-AUC as primary metric due to class imbalance
Model Comparison: Systematic comparison across multiple algorithms

📊 Business Insights

Actionable Recommendations

Retention Strategy: Focus on month-to-month contract customers
Service Bundling: Promote online security and backup services
Payment Method: Encourage automatic payment methods over electronic checks
Fiber Optic Issues: Investigate and address fiber optic service quality
Customer Segmentation: Develop targeted campaigns for high-risk segments

Risk Factors (In Order of Importance)

Contract type (Month-to-month highest risk)
Internet service type (Fiber optic highest risk)
Payment method (Electronic check highest risk)
Lack of additional services
Customer relationship status (No partner/dependents)

🚀 Usage

Run EDA: Open EDA.ipynb to explore data patterns and insights
Model Training: Use experiment.ipynb to train and compare models
Prediction: Apply the best model (Logistic Regression with feature engineering) for new predictions

📝 Future Improvements

Hyperparameter Tuning: Grid search for optimal parameters
Advanced Feature Engineering: Create interaction features, polynomial features
Ensemble Methods: Combine multiple models for better performance
Time Series Analysis: Analyze churn patterns over time
Customer Lifetime Value: Incorporate CLV into churn prediction

🤝 Contributing

Feel free to contribute by:

Adding new feature engineering techniques
Implementing advanced models
Improving data visualization
Adding more comprehensive evaluation metrics

Author: Data Science Project
Last Updated: July 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Customer Churn Prediction

📊 Project Overview

🗂️ Project Structure

📈 Dataset Description

Customer Demographics

Service Information

Account Information

Target Variable

🔍 Key Findings from EDA

Data Quality Issues

Churn Patterns

🚀 Machine Learning Experiments

Experiment 1: Baseline Models

Experiment 2: Feature Engineering

Experiment 3: SMOTE for Class Imbalance

🏆 Best Model Performance

🛠️ Technical Implementation

Prerequisites

Data Preprocessing Pipeline

Model Training Process

📊 Business Insights

Actionable Recommendations

Risk Factors (In Order of Importance)

🚀 Usage

📝 Future Improvements

🤝 Contributing

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.venv		.venv
EDA.ipynb		EDA.ipynb
README.md		README.md
churn.csv		churn.csv
experiment.ipynb		experiment.ipynb

CuongWao123/customer_churn_prediction

Folders and files

Latest commit

History

Repository files navigation

Customer Churn Prediction

📊 Project Overview

🗂️ Project Structure

📈 Dataset Description

Customer Demographics

Service Information

Account Information

Target Variable

🔍 Key Findings from EDA

Data Quality Issues

Churn Patterns

🚀 Machine Learning Experiments

Experiment 1: Baseline Models

Experiment 2: Feature Engineering

Experiment 3: SMOTE for Class Imbalance

🏆 Best Model Performance

🛠️ Technical Implementation

Prerequisites

Data Preprocessing Pipeline

Model Training Process

📊 Business Insights

Actionable Recommendations

Risk Factors (In Order of Importance)

🚀 Usage

📝 Future Improvements

🤝 Contributing

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages