Skip to content

Sivaprasad-V/Tokyo-Olympics-Azure-Data-Engineering-Project

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

16 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Tokyo-Olympics-Azure-Data-Engineering-Project


🌟 Project Highlights

This project implements a comprehensive ETL pipeline leveraging Microsoft Azure services to process and analyze data from the Tokyo 2021 Olympics. The goal is to explore Azure tools while building a robust framework for data ingestion, transformation, storage, and visualization. This learning-oriented project lays the foundation for scalable data engineering workflows.

image


πŸ“‚ Dataset Overview

The dataset includes five CSV files related to the Tokyo 2021 Olympics:

  • Athletes: Information on participants.
  • Coaches: Details of coaching staff.
  • EntriesGender: Gender-based participant statistics.
  • Medals: Medal distribution by event and country.
  • Teams: Information on participating teams.

πŸ› οΈ Azure Services Utilized

1. Azure Data Factory (ADF)

  • Facilitates data ingestion pipelines.
  • Transfers raw data from GitHub to Azure Data Lake Storage Gen2 (ADLS Gen2).

2. Azure Data Lake Storage Gen2 (ADLS Gen2)

  • Centralized storage for raw and transformed data.
  • Organized into folders: raw-data and transformed-data.

3. Azure Databricks

  • An Apache Spark-powered platform for processing and transforming data.
  • Handles schema validation, data cleaning, and transformations.

4. Azure Synapse Analytics

  • A data warehousing solution for structured analysis.
  • Hosts transformed data in SQL pools for querying and analysis.

5. Power BI

  • A visualization tool for creating interactive dashboards.
  • Planned integration for insights such as medal distributions, gender participation, and team performance.

πŸš€ ETL Workflow

Step 1: Data Ingestion

  • Used ADF pipelines to import raw data from GitHub.

  • Stored data in the raw-data folder of ADLS Gen2.

    image

    image

Step 2: Data Transformation

  • Processed data in Azure Databricks using PySpark:
    • Loaded CSV files into DataFrames.

    • Conducted schema validation and handled missing values.

    • Stored cleaned data in the transformed-data folder of ADLS Gen2.

      image image image image

Step 3: Data Warehousing

  • Linked transformed data to Azure Synapse Analytics.

  • Created structured tables in a SQL pool for querying and analysis.

    image

Step 4: Visualization (Planned)

  • Initiating Power BI integration for interactive dashboards.
  • Expected reports include insights on medal distributions, participation trends, and team performances.

πŸ’‘ Key Takeaways

  • Integrated Solution: Seamless use of Azure services for efficient data processing.
  • Scalability: Design ensures performance at scale with ADLS Gen2 and Synapse Analytics.
  • Efficiency: PySpark in Databricks ensures fast and reliable data transformations.
  • Analytics-Ready Data: Outputs are prepared for immediate querying and visualization.

πŸ”§ Project Implementation

Requirements

  • An active Microsoft Azure account.
  • Familiarity with ADF, ADLS Gen2, Databricks, Synapse Analytics, and Power BI.
  • Access to the Tokyo 2021 Olympics dataset.

Setup Instructions

  1. Clone this repository.
  2. Configure Azure resources as described in the ETL workflow.
  3. Use the provided PySpark scripts for data processing in Databricks.
  4. Link transformed data to Synapse Analytics for querying.
  5. (Optional) Create Power BI dashboards for visualization.

✍️ Potential Enhancements

Production-Ready Additions

  • Incorporate robust monitoring and logging in ADF pipelines.
  • Implement Role-Based Access Control (RBAC) for data security.
  • Optimize transformations for better performance.

Advanced Transformations

  • Conduct outlier detection and advanced data cleaning.
  • Apply feature engineering for richer analytics.

Dashboard Completion

  • Finalize Power BI dashboards for actionable insights.
  • Add detailed visualizations for medal counts, participation trends, and team performance.

πŸ“ Conclusion

This project highlights the potential of Microsoft Azure services for building efficient and scalable ETL pipelines. With minor enhancements, it can transition from a learning exercise to a production-ready solution.


πŸ‘€ Author

Feel free to contribute or suggest improvements!



πŸŽ₯ Credits

This project was inspired and guided by the excellent tutorial series created by Darshil Parmar.
Check out his YouTube channel for the full tutorial series: Darshil Parmar - YouTube.

Tutorial video for this project: Tokyo 2021 Olympics ETL Pipeline Tutorial.


Releases

No releases published

Packages

No packages published