This project applies a complete data science workflow to segment customers based on household demographics, purchasing patterns, and response to marketing campaigns. The analysis is built around the Dunnhumby “The Complete Journey” dataset and uses a Decision Tree Regressor (CART) to model behavior and uncover actionable insights.
We used the publicly available Dunnhumby - The Complete Journey dataset, which includes:
transactions
: Purchase history of householdsproduct
: Product-level details and categoriescoupon
: Coupon informationhh_demographic
: Household characteristicscampaign_table
: Campaign assignmentscausal_data
: Display and promotion datacampaign_desc
: Campaign descriptionscoupon_redempt
: Coupon redemption records
-
Data Ingestion & Cleaning
- Read large CSVs in chunks to handle memory efficiently
- Cleaned nulls and inconsistencies across all tables
-
Data Integration
- Joined tables to construct household-level profiles
- Created a relational ERD to define table connections
-
Feature Engineering
- Metrics:
- Average Transaction Value (ATV)
- Average Basket Size (ABS)
- Average Price Point (APP)
- Visit Frequency & Latency
- Coupon Redemption Rate
- Response to promotions
- Metrics:
-
Exploratory Data Analysis
- Visualized key metrics using
seaborn
andmatplotlib
- Identified behavioral patterns across customer segments
- Visualized key metrics using
-
Modeling
- Normalized feature set for training
- Trained a CART (Decision Tree Regressor) model
- Interpreted tree outputs to define segmentation logic
-
Output
- CSV exports for intermediate and final cleaned datasets
- Segment profiles with distinct behavioral traits
- Python 3.x
pandas
,numpy
,sqlalchemy
matplotlib
,seaborn
scikit-learn
tqdm
,gdown
- Clone the repo
- Ensure all dependencies are installed (
requirements.txt
) - Open and run the notebook: DSML_Customer_Segmentation_main_20250502.ipynb
- Inspect exported CSVs and visualizations
- Gratus Richard Anthuvan Rosario
- Fahad M Mujawar
- Sachin Joseph Fernando
This project was developed as part of an academic assessment for the Data Science with Machine Learning module (COMP4030).
For academic and educational use only.