🚚 Transportation & Logistics Data Platform using Medallion Architecture
📘 Abstract
This project focuses on processing and analyzing transportation and logistics data using PySpark, MySQL, and Power BI, all orchestrated within the Microsoft Azure ecosystem. The objective is to build a scalable, automated data platform using the Medallion Architecture, from raw data ingestion to final dashboard reporting.
Data engineers or customers simply need to add new data to raw files — the entire pipeline gets triggered automatically, and updated insights are reflected on the Power BI dashboard.
👭 Table of Contents
- Technologies Used
- Medallion Architecture Overview
- Platform Setup
- Azure Resource Group
- Azure Blob Storage (Raw)
- Azure Data Lake Gen2 (Bronze, Silver, Gold)
- Azure MySQL Database
- Azure Databricks
- Azure Data Factory
- Bronze Layer Implementation
- Silver Layer Implementation
- Gold Layer Implementation
- Power BI Dashboard
- Conclusion
🛠️ Technologies Used
- Microsoft Azure (Blob Storage, Data Lake Gen2, ADF, Databricks, MySQL)
- PySpark (Data Processing & Transformation)
- MySQL Workbench (SQL queries & schema creation)
- Azure Data Factory (Pipeline Automation)
- Power BI (Data Visualization & Reporting)
🧱 Medallion Architecture Overview
- Bronze Layer: Stores raw data (Parquet format) with audit logs.
- Silver Layer: Cleaned, structured, and enriched data.
- Gold Layer: Aggregated data ready for reporting and visualization.
⚙️ Platform Setup
Contains all services like Storage Accounts, Data Factory, MySQL, and Databricks.
Container: raw/
containing 4 CSV files.
Containers: bronze/
, silver/
, gold/
Created a Flexible Server and connected to MySQL Workbench.
Service: BizWorkspace
Notebook used for gold layer aggregation.
All orchestration and pipeline automation.
🥍 Bronze Layer Implementation
- Source: CSV files from Blob Storage
- Process: Copy activity with audit columns (
ingestion_date
,source_file
) - Destination: Parquet format in Bronze container
🥈 Silver Layer Implementation
- Filter nulls
- Select relevant features
- Join all 4 data sources
- Compute new columns:
fuel_consumed
,processed_date
,route_name
- Type casting
- Store in MySQL (
silver_db
) and Silver container
- Parquet in Silver Container
- MySQL Table in
silver_db
🥇 Gold Layer Implementation
- Read
silver_db.delivery_data_silver
table - Perform Aggregations:
- Route Optimization
- Fleet Performance
- Driver Performance
- Join and prepare final Gold Table
- Write to:
gold_db
in MySQLgold/
container as Parquet
- Include logging file
- MySQL Table (
gold_db.transportation_gold
) - Gold Container with Logging File
📊 Power BI Dashboard
Connected to gold_db.transportation_gold
via MySQL connector from Azure Server.
- Total Deliveries
- Average Delivery Time
- Fuel Efficiency
- Driver Performance
- Delivery Status
- Slicers:
- Delivery Status
- Route Name
- Driver Name
- Charts:
- Route Optimization (Line)
- Fleet Performance (Bar)
- Driver Performance (Scatter)
- Delivery Status Distribution (Pie)
- Time-based Trends (Line)
- Filtered by Deliveries Completed
- Filtered by Route & Driver Name
✅ Conclusion
This end-to-end data engineering project demonstrates the power of automated pipelines using the Medallion Architecture. From raw CSVs to a dynamic business dashboard, the system is designed to scale and adapt with minimal user input.
Once raw data is uploaded, everything from processing to visualization is triggered automatically, showcasing the synergy of Azure + PySpark + Power BI.