Batch Processing Pipeline with Azure Data Factory

Project Overview

Description

This project demonstrates the development of a robust batch processing pipeline designed to handle large-scale data ingestion and transformation. Using Azure Data Factory (ADF), the pipeline ingests data from multiple sources, applies transformations with Dataflows (powered by Spark), and loads the processed data into an Azure Data Lake Storage Gen2 lakehouse with incremental loading capabilities.

Tech Stack

Azure Data Factory: Orchestrates the pipeline.
Azure Data Lake Storage Gen2: Stores raw and processed data in a lakehouse structure.
Dataflows (Spark): Handles transformations like cleansing, deduplication, and aggregation.
Azure CLI: Automates deployment and configuration.
GitHub: Hosts configuration files, scripts, and documentation.

Prerequisites

Before starting, ensure you have:

An active Azure subscription.
Azure CLI installed (Installation Guide).
Basic knowledge of JSON (for ADF pipeline configs).
Access to sample data sources (e.g., CSV files, SQL database).

Setup Instructions

Follow these steps to deploy and run the pipeline. Each step includes Azure CLI commands (where applicable), detailed explanations, and screenshot placeholders for visual clarity.

Step 1: Create an Azure Resource Group

Why: A resource group organizes all related Azure resources (e.g., ADF, Data Lake) in one logical container for easier management and cleanup.
Command:

az group create --name BatchPipelineRG --location eastus

Explanation:

BatchPipelineRG is a unique name for the resource group.
eastus is chosen for proximity (adjust based on your region for latency optimization).

Step 2: Provision Azure Data Lake Storage Gen2

Why: The lakehouse architecture relies on Data Lake Storage Gen2 for hierarchical namespaces and scalability, ideal for raw and processed data storage.
Command:

az storage account create \
  --name batchpipelinestorage \
  --resource-group BatchPipelineRG \
  --location eastus \
  --sku Standard_LRS \
  --kind StorageV2 \
  --hns true

Explanation:

batchpipelinestorage is a globally unique name.
Standard_LRS (Locally Redundant Storage) balances cost and reliability.
--hns true enables hierarchical namespace for lakehouse compatibility.

Step 3: Create Containers in Data Lake Storage

Why: Containers segment data (e.g., raw, staging, processed) for organization and access control.
Command:

az storage fs create \
  --account-name batchpipelinestorage \
  --name raw \
  --auth-mode login
az storage fs create \
  --account-name batchpipelinestorage \
  --name staging \
  --auth-mode login
az storage fs create \
  --account-name batchpipelinestorage \
  --name processed \
  --auth-mode login

Explanation:

Three containers: raw (source data), staging (intermediate), and processed (final output).
--auth-mode login uses your Azure CLI credentials for authentication.

Step 4: Deploy Azure Data Factory

Why: ADF orchestrates the pipeline, connecting data sources, transformations, and sinks.
Command:

az datafactory create \
  --name BatchP-ADF \
  --resource-group BatchPipelineRG \
  --location eastus

Explanation:

BatchP-ADF is a unique name for the factory.
Placed in the same region as other resources to minimize latency.

Step 5: Link Data Lake Storage to ADF

Why: A linked service connects ADF to the Data Lake for data movement and storage.
Manual Step:

Open ADF in the Azure Portal.
Go to "Author" > "Manage" > "New Linked Service."
Select "Azure Data Lake Storage Gen2."
Enter:
- Name: DataLakeLinkedService
- Storage Account: batchpipelinestorage
- Authentication: Use "System Assigned Managed Identity" for security.
Test the connection and save.
Explanation:

Managed Identity avoids hardcoding credentials, enhancing security.
This links ADF to the lakehouse for read/write operations.
Screenshot: Show the "Linked Services" tab with DataLakeLinkedService configured.

I ran into an error while testing the connection. The error says, I need to enable auto-resolve-integration-runtime on the ADF. At the time I could not find the edit tab for the ADF and because I'd just made this resource, So I started over and made a new data factory and this time enabled auto-resolve-integration-runtime in the configuration settings under the networking tab.

I had to recreate the "BatchP-ADF" pipeline to accept auto-resolve-integration-runtime.

Connected with private endpoint.

Then I waited for approval state to clear.

Connection approved along with interactive authroring being enabled.

Edited storage account (ADLS) Added a new role assignment, my user account assigned to storage blob contributer. Selected members, system assigned user identities and ADF resource as the selected member.

Then reviewed and assign.

Refreshed the private endpoint on ADF and tested the connection again this time was successful.

Afterwards I click create, to create the linked service.

Step 6: Create a Pipeline in ADF

Why: The pipeline defines the workflow: ingest → transform → load.
Manual Step:

In ADF, go to "Author" > "Pipelines" > "New Pipeline."
Name it BatchProcessingPipeline.
Add activities (detailed in Step 7).
Explanation:

The pipeline is the core orchestration logic, chaining tasks in sequence.

Step 7: Configure Data Ingestion

Why: Ingesting from multiple sources (e.g., CSV, SQL) simulates real-world heterogeneity.
Manual Step:

Add a "Copy Data" activity to the pipeline.
Source: Upload a sample CSV to raw container (e.g., via Azure Portal or CLI).
Sink: Set to staging container via DataLakeLinkedService.
Configure mappings if needed.
CLI (Optional):

az storage blob upload \
  --account-name batchpipelinestorage \
  --container-name raw \
  --name sample_data.csv \
  --file ./sample_data.csv \
  --auth-mode login

Explanation:

The "Copy Data" activity moves data without transformation, preserving the raw state.
Sample CSV mimics a common enterprise data format.

Step 8: Add Transformation with Dataflows

Why: Dataflows (Spark-based) handle complex transformations like cleansing and deduplication scalably.
Manual Step:

In ADF, go to "Author" > "Dataflows" > "New Dataflow."
Name it TransformData.
Source: staging container.
Add transformations (e.g., remove duplicates, filter nulls).
Sink: processed container.
Link the Dataflow to the pipeline after the "Copy Data" activity.
Explanation:

Spark ensures scalability for large datasets.
Transformations improve data quality, a key Azure Data Engineer skill.

I renamed DelimitedText1 and DelimitedText2 to raw_data and staging_data.

Step 9: Implement Incremental Loading

Why: Incremental loading processes only new/changed data, optimizing performance.
Manual Step:

Add a "Lookup" activity before "Copy Data" to check the last processed timestamp (stored in a metadata file).
Use a "Filter" activity to select only new rows based on the timestamp.
Update the metadata file post-run with the latest timestamp.
Explanation:

Reduces redundant processing, critical for large-scale systems.
Metadata-driven logic is an industry best practice.

Step 10: Test and Monitor the Pipeline

Why: Validation ensures reliability; monitoring tracks performance.
Manual Step:

In ADF, click "Debug" to test the pipeline.
Go to "Monitor" to view run status and logs.
Explanation:

Debugging catches errors early.
Logs provide metrics (e.g., runtime, data volume) for optimization.

Name		Name	Last commit message	Last commit date
Latest commit History 81 Commits
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Batch Processing Pipeline with Azure Data Factory

Project Overview

Description

Tech Stack

Prerequisites

Setup Instructions

Step 1: Create an Azure Resource Group

Step 2: Provision Azure Data Lake Storage Gen2

Step 3: Create Containers in Data Lake Storage

Step 4: Deploy Azure Data Factory

Step 5: Link Data Lake Storage to ADF

Step 6: Create a Pipeline in ADF

Step 7: Configure Data Ingestion

Step 8: Add Transformation with Dataflows

Step 9: Implement Incremental Loading

Step 10: Test and Monitor the Pipeline

About

Uh oh!

Releases

Packages

DarrenDavy12/Batch-Processing-Pipeline-with-Azure-Data-Factory

Folders and files

Latest commit

History

Repository files navigation

Batch Processing Pipeline with Azure Data Factory

Project Overview

Description

Tech Stack

Prerequisites

Setup Instructions

Step 1: Create an Azure Resource Group

Step 2: Provision Azure Data Lake Storage Gen2

Step 3: Create Containers in Data Lake Storage

Step 4: Deploy Azure Data Factory

Step 5: Link Data Lake Storage to ADF

Step 6: Create a Pipeline in ADF

Step 7: Configure Data Ingestion

Step 8: Add Transformation with Dataflows

Step 9: Implement Incremental Loading

Step 10: Test and Monitor the Pipeline

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Packages