A comprehensive machine learning project for predicting customer churn in telecommunications using exploratory data analysis and multiple classification algorithms.
This project analyzes customer churn patterns in a telecommunications dataset and builds predictive models to identify customers likely to leave the service. The analysis includes extensive exploratory data analysis (EDA), feature engineering, and comparison of multiple machine learning algorithms.
churn_prediction/
├── churn.csv # Customer data with 21 features and churn labels
├── EDA.ipynb # Exploratory Data Analysis notebook
├── experiment.ipynb # Machine learning experiments notebook
└── README.md # Project documentation
The dataset contains 7,043 customer records with the following features:
customerID
: Unique customer identifiergender
: Customer gender (Male/Female)SeniorCitizen
: Whether customer is 65+ years old (0/1)Partner
: Whether customer has a partner (Yes/No)Dependents
: Whether customer has dependents (Yes/No)
tenure
: Number of months customer has stayedPhoneService
: Whether customer has phone serviceMultipleLines
: Whether customer has multiple linesInternetService
: Type of internet service (DSL/Fiber optic/No)OnlineSecurity
: Whether customer has online securityOnlineBackup
: Whether customer has online backupDeviceProtection
: Whether customer has device protectionTechSupport
: Whether customer has tech supportStreamingTV
: Whether customer has streaming TVStreamingMovies
: Whether customer has streaming movies
Contract
: Contract term (Month-to-month/One year/Two year)PaperlessBilling
: Whether customer has paperless billingPaymentMethod
: Payment method usedMonthlyCharges
: Monthly charges amountTotalCharges
: Total charges amount
Churn
: Whether customer churned (Yes/No)
- Missing Values: 11 missing values in
TotalCharges
column - Data Type Issues:
TotalCharges
stored as object instead of numeric - Feature Corrections:
SeniorCitizen
converted from binary to categorical
-
Imbalanced Dataset: Approximately 73% retention vs 27% churn
-
High-Risk Segments:
- Customers without partners or dependents (higher churn)
- Fiber optic internet users (>50% churn rate)
- Month-to-month contract customers (88.6% churn rate)
- Electronic check payment users (57.3% of churned customers)
- Customers without online services (>50% churn rate)
-
Low-Risk Segments:
- Long-term contract customers (1-year, 2-year)
- Customers with multiple online services
- DSL internet users
Approach: Basic data encoding with standard preprocessing
- Logistic Regression: 0.8614 ROC-AUC
- Random Forest: 0.8461 ROC-AUC
- XGBoost: 0.8358 ROC-AUC
- SVM: 0.8256 ROC-AUC
- Decision Tree: 0.6841 ROC-AUC
New Features Added:
tenure_bin
: Categorized tenure into meaningful periodsNumServicesUsed
: Count of additional servicesHasInternet
: Boolean internet service indicatorIsHighRiskPayment
: High-risk payment/contract combination
Results: Slight improvements for Logistic Regression and XGBoost
- Logistic Regression: 0.8620 ROC-AUC (+0.0006)
- XGBoost: 0.8388 ROC-AUC (+0.0030)
Approach: Applied SMOTE oversampling to address class imbalance
Results: Mixed improvements
- SVM: 0.8505 ROC-AUC (+0.0249)
- Decision Tree: 0.7323 ROC-AUC (+0.0482)
- XGBoost: 0.8447 ROC-AUC (+0.0089)
Logistic Regression consistently performed best across all experiments:
- Best ROC-AUC: 0.8620 (Experiment 2 with feature engineering)
- Strengths: Stable performance, interpretable results
- Model Choice: Recommended for production due to reliability and interpretability
pandas
numpy
matplotlib
seaborn
scikit-learn
xgboost
imbalanced-learn
- Data Type Conversion: Convert
TotalCharges
to numeric,SeniorCitizen
to categorical - Missing Value Handling: Median imputation for numerical, most frequent for categorical
- Feature Scaling: StandardScaler for numerical features
- Encoding: OneHotEncoder for categorical features
- Feature Engineering: Create derived features for better prediction
- Train-Test Split: 80-20 split with stratification
- Cross-Validation: Used for model selection and hyperparameter tuning
- Evaluation Metrics: ROC-AUC as primary metric due to class imbalance
- Model Comparison: Systematic comparison across multiple algorithms
- Retention Strategy: Focus on month-to-month contract customers
- Service Bundling: Promote online security and backup services
- Payment Method: Encourage automatic payment methods over electronic checks
- Fiber Optic Issues: Investigate and address fiber optic service quality
- Customer Segmentation: Develop targeted campaigns for high-risk segments
- Contract type (Month-to-month highest risk)
- Internet service type (Fiber optic highest risk)
- Payment method (Electronic check highest risk)
- Lack of additional services
- Customer relationship status (No partner/dependents)
- Run EDA: Open
EDA.ipynb
to explore data patterns and insights - Model Training: Use
experiment.ipynb
to train and compare models - Prediction: Apply the best model (Logistic Regression with feature engineering) for new predictions
- Hyperparameter Tuning: Grid search for optimal parameters
- Advanced Feature Engineering: Create interaction features, polynomial features
- Ensemble Methods: Combine multiple models for better performance
- Time Series Analysis: Analyze churn patterns over time
- Customer Lifetime Value: Incorporate CLV into churn prediction
Feel free to contribute by:
- Adding new feature engineering techniques
- Implementing advanced models
- Improving data visualization
- Adding more comprehensive evaluation metrics
Author: Data Science Project
Last Updated: July 2025