This project implements a comprehensive ETL pipeline leveraging Microsoft Azure services to process and analyze data from the Tokyo 2021 Olympics. The goal is to explore Azure tools while building a robust framework for data ingestion, transformation, storage, and visualization. This learning-oriented project lays the foundation for scalable data engineering workflows.
The dataset includes five CSV files related to the Tokyo 2021 Olympics:
- Athletes: Information on participants.
- Coaches: Details of coaching staff.
- EntriesGender: Gender-based participant statistics.
- Medals: Medal distribution by event and country.
- Teams: Information on participating teams.
- Facilitates data ingestion pipelines.
- Transfers raw data from GitHub to Azure Data Lake Storage Gen2 (ADLS Gen2).
- Centralized storage for raw and transformed data.
- Organized into folders:
raw-data
andtransformed-data
.
- An Apache Spark-powered platform for processing and transforming data.
- Handles schema validation, data cleaning, and transformations.
- A data warehousing solution for structured analysis.
- Hosts transformed data in SQL pools for querying and analysis.
- A visualization tool for creating interactive dashboards.
- Planned integration for insights such as medal distributions, gender participation, and team performance.
- Processed data in Azure Databricks using PySpark:
-
Linked transformed data to Azure Synapse Analytics.
-
Created structured tables in a SQL pool for querying and analysis.
- Initiating Power BI integration for interactive dashboards.
- Expected reports include insights on medal distributions, participation trends, and team performances.
- Integrated Solution: Seamless use of Azure services for efficient data processing.
- Scalability: Design ensures performance at scale with ADLS Gen2 and Synapse Analytics.
- Efficiency: PySpark in Databricks ensures fast and reliable data transformations.
- Analytics-Ready Data: Outputs are prepared for immediate querying and visualization.
- An active Microsoft Azure account.
- Familiarity with ADF, ADLS Gen2, Databricks, Synapse Analytics, and Power BI.
- Access to the Tokyo 2021 Olympics dataset.
- Clone this repository.
- Configure Azure resources as described in the ETL workflow.
- Use the provided PySpark scripts for data processing in Databricks.
- Link transformed data to Synapse Analytics for querying.
- (Optional) Create Power BI dashboards for visualization.
- Incorporate robust monitoring and logging in ADF pipelines.
- Implement Role-Based Access Control (RBAC) for data security.
- Optimize transformations for better performance.
- Conduct outlier detection and advanced data cleaning.
- Apply feature engineering for richer analytics.
- Finalize Power BI dashboards for actionable insights.
- Add detailed visualizations for medal counts, participation trends, and team performance.
This project highlights the potential of Microsoft Azure services for building efficient and scalable ETL pipelines. With minor enhancements, it can transition from a learning exercise to a production-ready solution.
- Name: Sivaprasad V
- LinkedIn: https://www.linkedin.com/in/sivaprasad-v-213011213?utm_source=share&utm_campaign=share_via&utm_content=profile&utm_medium=android_app
Feel free to contribute or suggest improvements!
This project was inspired and guided by the excellent tutorial series created by Darshil Parmar.
Check out his YouTube channel for the full tutorial series: Darshil Parmar - YouTube.
Tutorial video for this project: Tokyo 2021 Olympics ETL Pipeline Tutorial.