Retail Sales Data Lake, Analytics & Prediction Model

This repository contains the end-to-end process of ingesting, cleaning, transforming, analyzing, and predicting sales trends from a retail store. The analysis and transformation were performed on Databricks clusters using PySpark and SQL, and using MLflow to track machine learning experiments for predictive modeling.

Store and clean raw sales data in a data lake (Databricks FileStore).
Analyze and visualize trends using PySpark DataFrames and SQL queries.
Predict future sales trends using Machine Learning with MLflow tracking.
Deploy the trained model to predict sales for upcoming months.

Dataset

The dataset contains the following features:

Order ID
Product
Quantity Ordered
Price Each
Order Date
Purchase Address
Derived Features: City, State, ReportYear, ReportMonth

You can find more details on the columns and data in the Retail-Sales-Data-Preparation.ipynb notebook and the SQL scripts.

Technologies Used

Databricks for cluster management, notebooks, SQL analytics, experiment tracking, and deployment
PySpark for data manipulation and transformations
SQL (Databricks SQL) for analytical queries
Databricks Files System (DBFS) as the data lake storage location
Parquet / Delta Lake for optimized storage and querying
MLflow for model tracking and logging
Scikit-learn for the prediction model

Project Structure

Retail-Sales-Data_Preparation.ipynb
Contains the PySpark code for reading the raw CSV data, cleaning columns, handling missing values, and applying transformations.
Retail-Sales-Queries.sql
A collection of SQL queries that explore the sales data, run summary statistics, and handle intermediate analysis steps.
Retail-Sales-Analytics.sql
The final set of SQL queries that derive key business metrics, such as monthly sales, product-level aggregations, and top-selling categories.
Retail-Sales-Predictive-Analytics.ipynb
With PySpark, we predict monthly sales by city using a linear regression model. The experiments were tracked using MLflow, and the trained model was saved for future predictions.

Data Lake Architecture

Ingest
- The raw CSV data is uploaded to DBFS at a path such as dbfs:/FileStore/salesdata/input/.
- This location represents the raw zone of our data lake.
Transform
- Using PySpark in a Databricks notebook, the data was loaded from the raw zone.
- We cast columns (e.g. Quantity Ordered from String to Int), clean corrupted rows, added new columns, and fix other anomalies.
Publish
- The refined data is saved as Parquet / Delta Lake files in dbfs:/FileStore/salesdata/published/
- This acts as the curated zone of the data lake, optimized for analytics.
Analyze
- The published data is queried via Databricks SQL or Spark DataFrames, enabling fast aggregations and computations.
Prediction
- Using PySpark ML, a pipeline transforms features into a format suitable for ML model training.
- Define a linear regression model to a workflow that predicts retail sales while tracking the experiments with evaluation metrics and logging.
Deployment
- From the MLflow tracking server, access the trained model to make real-time predictions.

Data Preparation

The Retail-Sales-Data-Preparation.ipynb notebook includes:

Reading the Raw CSV Files
Cleaning & Transformations
- Removing header rows that slipped into the data
- Converting columns to proper data types
- Splitting out Purchase Address into City, State, etc.
Writing to the Curated Zone

Data Analysis

Analysis was performed via PySpark DataFrame APIs and SQL:

Exploratory Queries
- Finding total revenue by month
- Pinpoint peak ordering times
- Identifying popular product combinations in sales orders
Advanced Analytics
- Partitioning data by Year/Month for efficient queries
- Summarizing results in aggregated tables

Visualizations & Dashboard

The final step was creating a dashboard in Databricks to visualize:

Monthly Revenue Trends
Peak Ordering Times
Geographic Breakdown of Orders
Popular Product Combinations

Machine Learning Model & MLflow Tracking

Preprocessed the data with feature engineering techniques
Train Linear Regression model to predict monthly sales by city
Logged experiments in MLflow with evaluation metrics
Saved model and loaded it for future sales predictions
Deployed MLflow pipeline to track and compare multiple models

How to Reproduce

Clone Repository
git clone https://github.com/Cesar312/retail-sales-data-lake-predictive-analytics.git
Create Databricks Clusters
- Standard cluster for the EDA and SQL queries
- ML cluster for model training, experiment tracking, and deployment of the prediction model
Upload Data to Databricks
- Upload raw CSV files to dbfs:/FileStore/salesdata/input/
Run the PySpark Notebook
- Open Retail-Sales-Data-Preparation.ipynb
- Execute cells to clean and publish data
Train the Model
- Open Retail-Sales-Predictive-Analytics.ipynb
- Execute cells to train and track experiments
- View the evaluation metrics in MLflow UI
Predict Future Sales
- Load the trained model
- Make predictions for new months

Future Improvements

Automate model retraining on new data updates
Test advanced models like XGBoost or Time-Series Forecasting
Create a backend API microservice to serve predictions by deploying the trained ML model retrieved from the Databricks MLflow registry

License

This project is licensed under the MIT License -- see the LICENSE.txt file for details.

Databricks Community Edition

This project was developed and executed using Databricks Community Edition. The code and queries are shared for educational purposes. Databricks itself is a proprietary platform, and this repository does not include any proprietary Databricks software.

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
sql		sql
visualizations		visualizations
.gitignore		.gitignore
LICENSE.txt		LICENSE.txt
README.md		README.md
Retail-Sales-Data-Preparation.ipynb		Retail-Sales-Data-Preparation.ipynb
Retail-Sales-Predictive-Analytics.ipynb		Retail-Sales-Predictive-Analytics.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Retail Sales Data Lake, Analytics & Prediction Model

Table of Contents

Overview

Dataset

Technologies Used

Project Structure

Data Lake Architecture

Data Preparation

Data Analysis

Visualizations & Dashboard

Machine Learning Model & MLflow Tracking

How to Reproduce

Future Improvements

License

Databricks Community Edition

Last Modified Date: 2025-04-01

About

Uh oh!

Releases

Packages

Languages

License

Cesar312/retail-sales-data-lake-predictive-analytics

Folders and files

Latest commit

History

Repository files navigation

Retail Sales Data Lake, Analytics & Prediction Model

Table of Contents

Overview

Dataset

Technologies Used

Project Structure

Data Lake Architecture

Data Preparation

Data Analysis

Visualizations & Dashboard

Machine Learning Model & MLflow Tracking

How to Reproduce

Future Improvements

License

Databricks Community Edition

Last Modified Date: 2025-04-01

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages