This repository showcases a complete data engineering workflow on Azure for the NYC Green Taxi dataset:
- Bronze Layer: Ingest raw Parquet files from the NYC TLC API into the bronze container
- Silver Layer: Clean, dedupe, and enrich with PySpark in Databricks
- Gold Layer: Transformed data was saved in delta format
-
Source
- NYC TLC Green Taxi trip records (Parquet) for Jan–Dec 2024
- Data extracted via Azure Data Factory from HTTP API into ADLS Gen2
-
Orchestration
- Parameterized ADF pipelines: Copy Data, ForEach, If Condition
-
Storage
- Raw data stored in
bronze/
container as parquet files
- Raw data stored in
-
Compute
- Azure Databricks PySpark notebooks
-
Transformations
- Schema enforcement and type casting
- Split and normalize multi‑valued fields (e.g., payment types)
- Remove duplicate records; filter out bad mileage/fare entries
- Impute or drop nulls based on domain rules
-
Output
- Cleaned, enriched tables in
silver/
container
- Cleaned, enriched tables in
- Stored in
gold/
Delta tables, ready for BI tools
Component | Purpose |
---|---|
Azure Data Factory (ADF) | Data ingestion & orchestration |
Azure Data Lake Storage Gen2 | Scalable storage for Delta tables |
Azure Databricks | Spark ETL, Delta Live Tables, Workflows |
Delta Lake | ACID‑compliant, performant data format |
PySpark | Transformation logic |