Skip to content

jnjunior-96/ifood-data-analyst-case

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Case iFood - Data Analyst Case

The Company

Consider a well-established company operating in the retail food sector. Presently they have around several hundred thousand registered customers and serve almost one million consumers a year. They sell products from 5 major categories: wines, rare meat products, exotic fruits, specially prepared fish and sweet products. These can further be divided into gold and regular products. The customers can order and acquire products through 3 sales channels: physical stores, catalogs and company’s website. Globally, the company had solid revenues and a healthy bottom line in the past 3 years, but the profit growth perspectives for the next 3 years are not promising... For this reason, several strategic initiatives are being considered to invert this situation. One is to improve the performance of marketing activities, with a special focus on marketing campaigns.

The original text is here: PDF


Clusters

Project with the aim of improving my skills in machine learning, classification and clustering. Based on the selection process for iFood Data Analyst available neste repositório.

Objetives

The objective of this project is to train how to structure a project in a case applied to real life.

During the analysis it was possible to note the importance of carrying out exploratory data analysis as well as preprocessing.

Detailed objectives:

  • Build a robust exploratory analysis.
  • Segment customers from the provided database.
  • Build a classification model to predict whether a customer will purchase the product offered in the campaign.
  • Present a Data Science project structure, using notebooks, scripts, reports and a repository on GitHub.
  • Present good programming practices in Python, such as the use of functions and script files to facilitate code reuse.
  • Show good practices for using SciKit-Learn, such as the use of pipelines and hyperparameter optimization.

Corr


Repository structure


├── .gitignore          <- Files and directories to ignore by Git
├── environment.yml     <- The requirements file to reproduce the analysis environment
├── LICENSE             <- LICENSE
├── README.md           <- README main for developers using this project.
|
├── data                <- Data files for the project.
|
├── model               <- Trained and serialized models, model predictions or models summaries
|
├── notebooks           <- Jupyter notebooks. The naming convention is a number (for sorting)
│
|   └──src              <- Source code for use in this project.
|      │
|      ├── __init__.py  <- Makes a Python module
|      ├── config.py    <- Basic project settings
|      └── graphics.py  <- Scripts for creating exploratory and results-oriented visualizations
|
|
├── references          <- Data dictionaries, manuals and all other explanatory materials.
|
|── images              <- Graphs and figures generated for use in reports

Details of the dataset used and summary of the results

A detailed description of the dataset used is available here.

With a pipeline with preprocessing, PCA and K-Means, the base was segmented into 3 clusters:

Clusters

Viewing Cluster Profiles:

Barplot

Cluster analysis:

Cluster 0:

  • High income
  • High spending
  • Very likely to have no children
  • More likely to accept campaigns
  • Cluster without people with basic education
  • No age profile that stands out

Cluster 1:

  • Low income
  • Low spending
  • Probably to have children
  • Low propensity to accept campaigns
  • Only cluster with a significant percentage of people with basic education
  • Younger people

Cluster 2:

  • Intermediate income
  • Intermediate spending
  • Probably to have children
  • May accept campaigns
  • Older people

Subsequently, three classification models were trained to predict whether a customer will purchase the product offered in the campaign. The models used were:

Logistic Regression Decision Tree KNN A DummyClassifier was used as a baseline. The models were compared based on 6 metrics:

models

Based on this comparison, the Logistic Regression model was chosen to undergo hyperparameter optimization.

How to reproduce the project

The project was developed using Python 3.13.2. To reproduce the project, create a virtual environment with Conda, or a similar tool, with Python 3.13.2 and install the libraries below:

Versions of the packages:


 Package             |  Version  
-------------------- | ----------
Imbalanced-Learn     |     0.13.0
Matplotlib           |     3.10.1
NumPy                |      2.2.4
Pandas               |      2.2.3
Scikit-Learn         |      1.6.1
Seaborn              |     0.13.2

Python version: 3.13.2

author

About

The objective of this project is to train how to structure a project in a case applied to real life.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published