This project involves analyzing and segmenting customers of a UK-based online giftware store. The analysis focuses on feature engineering, K-Means clustering, and visualization techniques to derive insights into customer behavior.
- Python
- Pandas
- Matplotlib
- Seaborn
- Scikit-learn
The dataset for this analysis can be found at: UCI Machine Learning Repository - Online Retail II.
- Data Exploration: Initial exploration of the dataset, including checking for missing values and data types.
- Data Cleaning: Processes to clean the dataset by removing invalid entries and handling missing values.
- Feature Engineering: Creating new features that help in understanding customer behavior.
- K-Means Clustering: Implementing K-Means clustering to segment customers based on their purchasing patterns.
- Visualization: Creating visual representations of the clustered data for better understanding.
- Loaded the dataset and performed preliminary checks.
- Noted the presence of null values in the
Customer ID
column and negative values in theQuantity
andPrice
columns. - Observed that some invoices and stock codes did not conform to expected formats.
- Removed rows with null
Customer ID
values and negative quantities. - Dropped transactions with zero or negative prices.
- Filtered the dataset to retain only valid invoices (6-digit numbers).
- Calculated
SalesLineTotal
as the product ofQuantity
andPrice
. - Aggregated data by
Customer ID
to compute:- Monetary Value: Total spending per customer.
- Frequency: Number of unique invoices per customer.
- Last Invoice Date: The most recent transaction date per customer.
- Applied K-Means clustering using the features of
Monetary Value
,Frequency
, andRecency
(derived fromLast Invoice Date
). - Evaluated the clustering results using silhouette scores to determine the optimal number of clusters.
- Created visualizations to depict clusters and customer segments.
- Utilized seaborn and matplotlib for effective data representation.
- The dataset had significant missing values in the
Customer ID
column, which required careful handling. - Negative quantities and prices indicated potential data entry errors or cancellations, which were filtered out during data cleaning.
- The analysis revealed distinct customer segments that could be targeted with tailored marketing strategies.
The project successfully segmented customers of the giftware store, providing valuable insights into purchasing behavior. These insights can guide marketing efforts and improve customer relationship management.
- Further refining the clustering algorithm to enhance segmentation accuracy.
- Incorporating additional features like customer demographics for deeper analysis.
- Developing a predictive model to forecast future customer behavior.
Thanks to the UCI Machine Learning Repository for providing the dataset used in this analysis.