This project focuses on analyzing customer data to predict churn in a telecommunications company. The primary objective is to identify key factors that contribute to customer churn and to build a predictive model that can accurately identify customers who are likely to leave. By understanding the drivers of churn, businesses can implement targeted retention strategies to reduce customer attrition.
- Project Overview
- Objectives
- Data Source
- Methodology
- Libraries and Tools Used
- Key Findings and Results
- Conclusion
- Future Work
- How to Run
- Contributors
- To understand the patterns and characteristics of customers who churn versus those who do not.
- To identify the most significant features influencing customer churn.
- To build and evaluate various machine learning models for churn prediction.
- To provide actionable insights for customer retention strategies.
- To analyze the distribution of churn across different customer segments.
- Dataset : Download
- Description: The dataset was sourced from a telecom provider and contains 7043 customer records with 21 features. It includes customer demographics, account information, services subscribed, and churn status.
- Features:
customerID
,gender
,SeniorCitizen
,Partner
,Dependents
,tenure
,PhoneService
,MultipleLines
,InternetService
,OnlineSecurity
,OnlineBackup
,DeviceProtection
,TechSupport
,StreamingTV
,StreamingMovies
,Contract
,PaperlessBilling
,PaymentMethod
,MonthlyCharges
,TotalCharges
,Churn
- Target Variable:
Churn
(Yes/No)
The project follows a standard data science workflow:
- Loaded the dataset using Pandas from a CSV file.
- Performed initial checks:
- Data Shape: (7043, 21)
- Data Types: Mixed numeric and object types.
- Basic Statistics: Examined numerical features using
df.describe()
. - Missing Values Check: Found 11 missing values in
TotalCharges
.
- Missing Values: Handled 11 missing values in
TotalCharges
by converting to numeric and dropping NA values. - Data Type Conversion: Converted
TotalCharges
from object to numeric type. - Categorical Variable Encoding: Prepared categorical features for analysis (status: not yet encoded in current notebook).
- Feature Scaling: Implemented
-
Churn Distribution: Visualized the target variable, revealing a distribution of 73.4% No Churn and 26.6% Churn.
-
Univariate Analysis: Examined distributions of numerical features and counts for categorical features.
-
Bivariate Analysis: Created countplots for all categorical features against churn status to observe relationships.
-
Correlation Analysis:Implemented
- Feature Selection: All features were initially considered for analysis.
- New Features: Created a tenure group from the tenure feature which was numerical
- Data Splitting : use the 80-20 split cross validation
- Models Training : Trained : Logistic Regression , Random Forest , XGBoost, Ada Boost , Gradient Boosting and a simple Neural Network
- Handling Class Imbalance : Used SMOT analysis
- Hyperparameter Tuning : Used Optuna to do hyper-parameter tuning
- Python >= 3.10
- Pandas: For data manipulation and analysis.
- NumPy: For numerical operations.
- Matplotlib: For basic visualization.
- Seaborn: For advanced statistical visualizations.
- Scikit-learn: Used for classification algorithms (Logistic Regression, Random Forest, Gradient Boosting, AdaBoost, Decision Trees), data preprocessing, model selection, and performance evaluation.
- XGBoost: For optimized gradient boosting models.
- LIME: For local interpretable model explanations.
- SHAP: For model interpretability and feature importance.
- Optuna: For hyperparameter tuning.
- TensorFlow Keras: For building and training neural networks.
- Jupyter Notebook: As the development environment.
This project identified key churn drivers and built predictive models to flag high-risk customers. Gradient Boosting outperformed all models based on AUC-ROC and was selected as the final model. SHAP and LIME provided valuable insights into feature importance, supporting strategic business decisions.
- Targeted Retention: Offer personalized discounts or upgrades to high-risk customers.
- Proactive Support: Use alerts to flag churn risks and assign dedicated support.
- Customer Engagement: Send usage tips and educational content to boost satisfaction.
- Risk-Based Segmentation: Tailor campaigns by churn risk (high, medium, low).
- Real-Time Scoring: Deploy the model to enable live churn predictions and actions.
- Model Updates: Retrain with new data regularly for improved accuracy.