In this project, a machine learning model to predict customer churn for a video streaming service is buit. Subscription services often face the challenge of retaining customers, and predicting churn is crucial for targeted interventions. This repository contains a complete data science workflow to tackle this problem, using a unique dataset provided for the challenge.
The project is organized into the following phases:
- Data Loading: Load the train and test datasets.
- Exploratory Data Analysis (EDA): Analyze the data to understand distributions, correlations, and basic statistics.
- Data Cleaning: Handle missing values and prepare the data for modeling.
- Feature Selection: Select important features based on correlation and feature importance from a RandomForest model.
- Model Building: Build a RandomForestClassifier to predict churn.
- Model Evaluation: Evaluate the model using cross-validation.
- Prediction: Make predictions on the test dataset.
train.csv
: Training dataset with 243,787 subscriptions and the target variableChurn
.test.csv
: Test dataset with 104,480 subscriptions for which predictions are to be made.data_descriptions.csv
: Description of the dataset features.
- Python 3.6+
- Libraries: pandas, numpy, seaborn, matplotlib, scikit-learn, shap
- Clone the repository:
git clone https://github.com/your-username/churn-management-model.git cd churn-management-model
- Install the required libraries:
pip install pandas numpy seaborn matplotlib scikit-learn shap
The model's performance is evaluated using ROC AUC. Mean CV ROC AUC: 0.7