The Company
Consider a well-established company operating in the retail food sector. Presently they have around several hundred thousand registered customers and serve almost one million consumers a year. They sell products from 5 major categories: wines, rare meat products, exotic fruits, specially prepared fish and sweet products. These can further be divided into gold and regular products. The customers can order and acquire products through 3 sales channels: physical stores, catalogs and company’s website. Globally, the company had solid revenues and a healthy bottom line in the past 3 years, but the profit growth perspectives for the next 3 years are not promising... For this reason, several strategic initiatives are being considered to invert this situation. One is to improve the performance of marketing activities, with a special focus on marketing campaigns.
The original text is here: PDF
Project with the aim of improving my skills in machine learning, classification and clustering. Based on the selection process for iFood Data Analyst available neste repositório.
The objective of this project is to train how to structure a project in a case applied to real life.
During the analysis it was possible to note the importance of carrying out exploratory data analysis as well as preprocessing.
Detailed objectives:
- Build a robust exploratory analysis.
- Segment customers from the provided database.
- Build a classification model to predict whether a customer will purchase the product offered in the campaign.
- Present a Data Science project structure, using notebooks, scripts, reports and a repository on GitHub.
- Present good programming practices in Python, such as the use of functions and script files to facilitate code reuse.
- Show good practices for using SciKit-Learn, such as the use of pipelines and hyperparameter optimization.
├── .gitignore <- Files and directories to ignore by Git
├── environment.yml <- The requirements file to reproduce the analysis environment
├── LICENSE <- LICENSE
├── README.md <- README main for developers using this project.
|
├── data <- Data files for the project.
|
├── model <- Trained and serialized models, model predictions or models summaries
|
├── notebooks <- Jupyter notebooks. The naming convention is a number (for sorting)
│
| └──src <- Source code for use in this project.
| │
| ├── __init__.py <- Makes a Python module
| ├── config.py <- Basic project settings
| └── graphics.py <- Scripts for creating exploratory and results-oriented visualizations
|
|
├── references <- Data dictionaries, manuals and all other explanatory materials.
|
|── images <- Graphs and figures generated for use in reports
A detailed description of the dataset used is available here.
With a pipeline with preprocessing, PCA and K-Means, the base was segmented into 3 clusters:
Viewing Cluster Profiles:
Cluster analysis:
Cluster 0:
- High income
- High spending
- Very likely to have no children
- More likely to accept campaigns
- Cluster without people with basic education
- No age profile that stands out
Cluster 1:
- Low income
- Low spending
- Probably to have children
- Low propensity to accept campaigns
- Only cluster with a significant percentage of people with basic education
- Younger people
Cluster 2:
- Intermediate income
- Intermediate spending
- Probably to have children
- May accept campaigns
- Older people
Subsequently, three classification models were trained to predict whether a customer will purchase the product offered in the campaign. The models used were:
Logistic Regression Decision Tree KNN A DummyClassifier was used as a baseline. The models were compared based on 6 metrics:
Based on this comparison, the Logistic Regression model was chosen to undergo hyperparameter optimization.
The project was developed using Python 3.13.2. To reproduce the project, create a virtual environment with Conda, or a similar tool, with Python 3.13.2 and install the libraries below:
Versions of the packages:
Package | Version
-------------------- | ----------
Imbalanced-Learn | 0.13.0
Matplotlib | 3.10.1
NumPy | 2.2.4
Pandas | 2.2.3
Scikit-Learn | 1.6.1
Seaborn | 0.13.2
Python version: 3.13.2