Skip to content

CuongWao123/customer_churn_prediction

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Customer Churn Prediction

A comprehensive machine learning project for predicting customer churn in telecommunications using exploratory data analysis and multiple classification algorithms.

📊 Project Overview

This project analyzes customer churn patterns in a telecommunications dataset and builds predictive models to identify customers likely to leave the service. The analysis includes extensive exploratory data analysis (EDA), feature engineering, and comparison of multiple machine learning algorithms.

🗂️ Project Structure

churn_prediction/
├── churn.csv              # Customer data with 21 features and churn labels
├── EDA.ipynb             # Exploratory Data Analysis notebook
├── experiment.ipynb      # Machine learning experiments notebook
└── README.md            # Project documentation

📈 Dataset Description

The dataset contains 7,043 customer records with the following features:

Customer Demographics

  • customerID: Unique customer identifier
  • gender: Customer gender (Male/Female)
  • SeniorCitizen: Whether customer is 65+ years old (0/1)
  • Partner: Whether customer has a partner (Yes/No)
  • Dependents: Whether customer has dependents (Yes/No)

Service Information

  • tenure: Number of months customer has stayed
  • PhoneService: Whether customer has phone service
  • MultipleLines: Whether customer has multiple lines
  • InternetService: Type of internet service (DSL/Fiber optic/No)
  • OnlineSecurity: Whether customer has online security
  • OnlineBackup: Whether customer has online backup
  • DeviceProtection: Whether customer has device protection
  • TechSupport: Whether customer has tech support
  • StreamingTV: Whether customer has streaming TV
  • StreamingMovies: Whether customer has streaming movies

Account Information

  • Contract: Contract term (Month-to-month/One year/Two year)
  • PaperlessBilling: Whether customer has paperless billing
  • PaymentMethod: Payment method used
  • MonthlyCharges: Monthly charges amount
  • TotalCharges: Total charges amount

Target Variable

  • Churn: Whether customer churned (Yes/No)

🔍 Key Findings from EDA

Data Quality Issues

  • Missing Values: 11 missing values in TotalCharges column
  • Data Type Issues: TotalCharges stored as object instead of numeric
  • Feature Corrections: SeniorCitizen converted from binary to categorical

Churn Patterns

  1. Imbalanced Dataset: Approximately 73% retention vs 27% churn

  2. High-Risk Segments:

    • Customers without partners or dependents (higher churn)
    • Fiber optic internet users (>50% churn rate)
    • Month-to-month contract customers (88.6% churn rate)
    • Electronic check payment users (57.3% of churned customers)
    • Customers without online services (>50% churn rate)
  3. Low-Risk Segments:

    • Long-term contract customers (1-year, 2-year)
    • Customers with multiple online services
    • DSL internet users

🚀 Machine Learning Experiments

Experiment 1: Baseline Models

Approach: Basic data encoding with standard preprocessing

  • Logistic Regression: 0.8614 ROC-AUC
  • Random Forest: 0.8461 ROC-AUC
  • XGBoost: 0.8358 ROC-AUC
  • SVM: 0.8256 ROC-AUC
  • Decision Tree: 0.6841 ROC-AUC

Experiment 2: Feature Engineering

New Features Added:

  • tenure_bin: Categorized tenure into meaningful periods
  • NumServicesUsed: Count of additional services
  • HasInternet: Boolean internet service indicator
  • IsHighRiskPayment: High-risk payment/contract combination

Results: Slight improvements for Logistic Regression and XGBoost

  • Logistic Regression: 0.8620 ROC-AUC (+0.0006)
  • XGBoost: 0.8388 ROC-AUC (+0.0030)

Experiment 3: SMOTE for Class Imbalance

Approach: Applied SMOTE oversampling to address class imbalance

Results: Mixed improvements

  • SVM: 0.8505 ROC-AUC (+0.0249)
  • Decision Tree: 0.7323 ROC-AUC (+0.0482)
  • XGBoost: 0.8447 ROC-AUC (+0.0089)

🏆 Best Model Performance

Logistic Regression consistently performed best across all experiments:

  • Best ROC-AUC: 0.8620 (Experiment 2 with feature engineering)
  • Strengths: Stable performance, interpretable results
  • Model Choice: Recommended for production due to reliability and interpretability

🛠️ Technical Implementation

Prerequisites

pandas
numpy
matplotlib
seaborn
scikit-learn
xgboost
imbalanced-learn

Data Preprocessing Pipeline

  1. Data Type Conversion: Convert TotalCharges to numeric, SeniorCitizen to categorical
  2. Missing Value Handling: Median imputation for numerical, most frequent for categorical
  3. Feature Scaling: StandardScaler for numerical features
  4. Encoding: OneHotEncoder for categorical features
  5. Feature Engineering: Create derived features for better prediction

Model Training Process

  1. Train-Test Split: 80-20 split with stratification
  2. Cross-Validation: Used for model selection and hyperparameter tuning
  3. Evaluation Metrics: ROC-AUC as primary metric due to class imbalance
  4. Model Comparison: Systematic comparison across multiple algorithms

📊 Business Insights

Actionable Recommendations

  1. Retention Strategy: Focus on month-to-month contract customers
  2. Service Bundling: Promote online security and backup services
  3. Payment Method: Encourage automatic payment methods over electronic checks
  4. Fiber Optic Issues: Investigate and address fiber optic service quality
  5. Customer Segmentation: Develop targeted campaigns for high-risk segments

Risk Factors (In Order of Importance)

  1. Contract type (Month-to-month highest risk)
  2. Internet service type (Fiber optic highest risk)
  3. Payment method (Electronic check highest risk)
  4. Lack of additional services
  5. Customer relationship status (No partner/dependents)

🚀 Usage

  1. Run EDA: Open EDA.ipynb to explore data patterns and insights
  2. Model Training: Use experiment.ipynb to train and compare models
  3. Prediction: Apply the best model (Logistic Regression with feature engineering) for new predictions

📝 Future Improvements

  • Hyperparameter Tuning: Grid search for optimal parameters
  • Advanced Feature Engineering: Create interaction features, polynomial features
  • Ensemble Methods: Combine multiple models for better performance
  • Time Series Analysis: Analyze churn patterns over time
  • Customer Lifetime Value: Incorporate CLV into churn prediction

🤝 Contributing

Feel free to contribute by:

  • Adding new feature engineering techniques
  • Implementing advanced models
  • Improving data visualization
  • Adding more comprehensive evaluation metrics

Author: Data Science Project
Last Updated: July 2025

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published