Azure Data Pipeline - (Databricks - Azure Data Factory - Azure Data Lake - Azure Blob Storage - Python) - Earthquake Events and Risks Project

Ingested data using python within Databricks, to get it from a API endpoint into a bronze layer of an medallion architecture. Azure data lake storage container within a storage account.

Then used Databricks to transform the raw data from bronze container into a silver container, and then to the gold container to be used directly for business use.

I have then brought these notebooks into Azure Data Factory, extracted variables out of the notebook into Data factory and orchestrated the pipeline from Data factory as well.

Resource Setup This section covers the initial setup of necessary Azure resources, divided into creating Databricks workspace, setting up storage account, creating containers, and setting up Synapse Analytics.

Creating Databricks Workspace In Azure, create a new Databricks workspace, choosing the premium pricing tier for this personal project. Review and create the resource, then wait for deployment.

Setting Up Storage Account Create a storage account while the Databricks workspace is deploying. In the basics tab, configure settings, and in the advanced tab, enable hierarchical namespace. Review and create, then wait for deployment. Note the additional resource group managed by Databricks.

Creating Containers Navigate to the containers section in the storage account. Create three containers: bronze (raw data for cleaning), silver (clean data for analytics), and gold (enriched data for business use). Overview of created containers.

Setting Up Synapse Analytics Create a Synapse Analytics workspace, choosing the storage account created earlier. Create a file system named "synapse-fs" and assign "Storage Blob Data Contributor" role. In the security tab, select only Entra ID for authentication. Review and create, then wait for deployment.

Databricks Configuration This section details configuring the Databricks environment for data processing, including launching compute instances, creating credentials, and setting up external locations.

Launching Compute Instance Launch the Databricks workspace and start a compute instance. Select single node, uncheck Photon Acceleration, choose Standard_DS3-v2 (14 GB, 4 Cores). Set self-termination after 30 minutes to avoid unnecessary costs, considering cluster spin-up time is around 15 minutes. Wait for the cluster to be created.

Creating Credential Navigate to Catalog > External Data > Credentials > Create Credential. Choose Azure managed identity, name it "earthquakePipeline". Find the access connector ID, either by searching in Azure or locating in the Databricks resource under managed identities. Paste the ID and create, verifying two credentials are present (default and new).

Creating External Locations Navigate to Catalog > External Data > External Locations > Create External Location. Create locations for bronze, silver, and gold, using the URL for each container and the new credential. Initially, encounter an error due to empty containers, force update for all. Overview of created external locations.

Notebooks Creation and Execution This section covers setting up access for the storage account, and creating notebooks for each layer of the medallion architecture.

Setting Up Access for Storage Account In the storage account, go to Access Control (IAM) > Role Assignments. Add a new role assignment for "Storage Blob Contributor", allowing read, write, and delete access to blob containers. Assign to managed identity "unity-catalog-access-connector", pasting the resource ID from Databricks connector. Review and assign, ensuring two communications are set up.

Creating Notebooks In Databricks workspace, create notebooks for bronze, silver, and gold layers. Bronze Notebook: Ingest raw data from the USGS API (API Documentation - Earthquake Catalog), store in Parquet format. Silver Notebook: Clean and normalize data, remove duplicates, handle missing values. Gold Notebook: Aggregate and enrich data, e.g., add country codes using "reverse_geocoder" package. Install required libraries in the cluster, wait 5 minutes for installation.

BRONZE:

SILVER:

Name		Name	Last commit message	Last commit date
Latest commit History 38 Commits
images		images
Earthquake Events and Risks Project - Azure Data Pipeline.md		Earthquake Events and Risks Project - Azure Data Pipeline.md
readme.md		readme.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Azure Data Pipeline - (Databricks - Azure Data Factory - Azure Data Lake - Azure Blob Storage - Python) - Earthquake Events and Risks Project

About

Uh oh!

Releases

Packages

DarrenDavy12/Earthquake-Events-and-Risks-Project---Azure-Data-Pipeline---API-Connection-

Folders and files

Latest commit

History

Repository files navigation

Azure Data Pipeline - (Databricks - Azure Data Factory - Azure Data Lake - Azure Blob Storage - Python) - Earthquake Events and Risks Project

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Packages