Skip to content

An end‑to‑end data engineering pipeline for NYC Green Taxi trip records, built on Microsoft Azure. This project ingests Jan–Dec 2024 Parquet files from the NYC Taxi API into a Bronze Delta Lake layer, cleans and enriches the data in a Silver layer with PySpark on Azure Databricks, then saves the transformed data to the Gold layer in delta format

Notifications You must be signed in to change notification settings

jotstolu/Azure-Data-Engineering-End-to-End-Project---NYC-taxi-dataset

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 

Repository files navigation

Azure-Data-Engineering-End-to-End-Project - NYC-taxi-dataset

This repository showcases a complete data engineering workflow on Azure for the NYC Green Taxi dataset:

  • Bronze Layer: Ingest raw Parquet files from the NYC TLC API into the bronze container
  • Silver Layer: Clean, dedupe, and enrich with PySpark in Databricks
  • Gold Layer: Transformed data was saved in delta format

PROJECT ARCHITECTURE

image

Phase 1: Bronze (Raw Ingestion)

Phase 2: Silver (Cleansing & Enrichment)

  • Compute

    • Azure Databricks PySpark notebooks
  • Transformations

    • Schema enforcement and type casting
    • Split and normalize multi‑valued fields (e.g., payment types)
    • Remove duplicate records; filter out bad mileage/fare entries
    • Impute or drop nulls based on domain rules
  • Output

    • Cleaned, enriched tables in silver/ container

Phase 3: Gold (Quality & Aggregation)

  • Stored in gold/ Delta tables, ready for BI tools

Technology Stack

Component Purpose
Azure Data Factory (ADF) Data ingestion & orchestration
Azure Data Lake Storage Gen2 Scalable storage for Delta tables
Azure Databricks Spark ETL, Delta Live Tables, Workflows
Delta Lake ACID‑compliant, performant data format
PySpark Transformation logic

About

An end‑to‑end data engineering pipeline for NYC Green Taxi trip records, built on Microsoft Azure. This project ingests Jan–Dec 2024 Parquet files from the NYC Taxi API into a Bronze Delta Lake layer, cleans and enriches the data in a Silver layer with PySpark on Azure Databricks, then saves the transformed data to the Gold layer in delta format

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published