Welcome to my Azure End-to-End Data Engineering Project! In this project, I used various Azure tools and technologies to ingest, transform, and serve data efficiently. Letβs dive into what Iβve built step by step. π οΈ
This project handles data in three major steps:
- Ingestion: Collect raw data and store it.
- Transformation: Clean, process, and transform the data.
- Serving: Save the transformed data in a serving layer for further analysis.
I used Azure Data Factory, Azure Data Lake Storage (Gen2), Databricks, and Delta Lake for this project.
Hereβs the overall architecture of the project:
-
Source Data π΅
- I collected data (in this case, Taxi data) from an API.
- API LINK https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page
- File Link https://d37ci6vzurychx.cloudfront.net/trip-data/green_tripdata_2023-01.parquet
-
Ingestion Layer π
- I ingested the raw data into Azure Data Lake Gen2.
- I stored the data in Parquet format for efficient storage and processing.
-
Transformation Layer π§Ή
- I used Azure Databricks to clean and transform the data.
- I converted the raw data into a meaningful format using Spark and saved it back to Azure Data Lake Gen2.
-
Serving Layer π―
- The cleaned, transformed data was saved as a Delta Table in the serving layer for analytics and reporting.
-
Security π
- I ensured secure access to all Azure services using Azure Active Directory and role-based access control (RBAC).
-
I used Azure Data Factory to extract data from an external API.
-
The data was stored as raw Parquet files in the Bronze folder of Data Lake Gen2.
- I used Databricks for data processing and transformation.
- The tasks included:
-
I converted the transformed data into Delta Tables for optimized storage and performance.
-
The data was stored in the Gold Folder of Data Lake Gen2.
-
Delta Tables make it easy to query and perform analytics on the data.
- I applied Azure Security features to protect the data.
- Tools used:
- Azure Active Directory for authentication.
- Role-Based Access Control (RBAC) for authorization.
Tool/Service | Purpose |
---|---|
Azure Data Factory | Ingesting data from APIs. |
Azure Data Lake Gen2 | Storage for raw, transformed, and served data. |
Databricks | Processing and transforming the data using Spark. |
Parquet | Storage format for raw and transformed data. |
Delta Lake | Optimized format for serving clean data. |
Azure Active Directory | Security and access management. |
Hereβs how I organized my data in Azure Data Lake Gen2:
DataLakeGen2/
β
βββ bronze/ <- Raw data in Parquet format
β
βββ silver/ <- Cleaned and transformed data
β
βββ gold/ <- Final Delta Tables for analytics
-
Dynamic Pipeline in Data Factory
I created a dynamic pipeline to handle data ingestion dynamically, based on certain conditions. For example:- Copy Data 1 for small numbers.
- Copy Data 2 for large numbers.
-
Efficient Transformation
Using Databricks, I transformed raw data into meaningful, clean data quickly and efficiently. -
Serving Layer with Delta Tables
The final output is stored as Delta Tables, which provide fast query performance and are perfect for analytics. -
Security First
All data movement and storage are secured using Azureβs built-in security features.
Follow these steps to run the project:
-
Set up Azure Resources:
- Create Azure Data Lake Gen2, Azure Databricks, and Azure Data Factory.
-
Ingest Data:
- Use Azure Data Factory to collect data and store it in Data Lake Gen2.
-
Transform Data:
- Use Databricks to clean and transform the data.
-
Store Data:
- Write the transformed data as Delta Tables to the serving layer.
-
Analyze the Data:
- Query the Delta Tables for reporting and insights.
This project shows how you can build an end-to-end data engineering solution using Azure tools. It covers everything from ingesting data to transforming it and saving it securely for analysis.
Feel free to clone this project and try it on your own. Let me know if you have any questions or suggestions! π
- Add more transformations in Databricks.
- Use Power BI or Synapse Analytics to visualize the data.
- Automate pipelines further using triggers.
I hope this makes it crystal clear! Let me know if you want any changes or additions. π