This project demonstrates the development of a robust batch processing pipeline designed to handle large-scale data ingestion and transformation. Using Azure Data Factory (ADF), the pipeline ingests data from multiple sources, applies transformations with Dataflows (powered by Spark), and loads the processed data into an Azure Data Lake Storage Gen2 lakehouse with incremental loading capabilities.
- Azure Data Factory: Orchestrates the pipeline.
- Azure Data Lake Storage Gen2: Stores raw and processed data in a lakehouse structure.
- Dataflows (Spark): Handles transformations like cleansing, deduplication, and aggregation.
- Azure CLI: Automates deployment and configuration.
- GitHub: Hosts configuration files, scripts, and documentation.
Before starting, ensure you have:
- An active Azure subscription.
- Azure CLI installed (Installation Guide).
- Basic knowledge of JSON (for ADF pipeline configs).
- Access to sample data sources (e.g., CSV files, SQL database).
Follow these steps to deploy and run the pipeline. Each step includes Azure CLI commands (where applicable), detailed explanations, and screenshot placeholders for visual clarity.
Why: A resource group organizes all related Azure resources (e.g., ADF, Data Lake) in one logical container for easier management and cleanup.
Command:
az group create --name BatchPipelineRG --location eastus
Explanation:
BatchPipelineRG
is a unique name for the resource group.eastus
is chosen for proximity (adjust based on your region for latency optimization).
Why: The lakehouse architecture relies on Data Lake Storage Gen2 for hierarchical namespaces and scalability, ideal for raw and processed data storage.
Command:
az storage account create \
--name batchpipelinestorage \
--resource-group BatchPipelineRG \
--location eastus \
--sku Standard_LRS \
--kind StorageV2 \
--hns true
Explanation:
batchpipelinestorage
is a globally unique name.Standard_LRS
(Locally Redundant Storage) balances cost and reliability.--hns true
enables hierarchical namespace for lakehouse compatibility.
Why: Containers segment data (e.g., raw, staging, processed) for organization and access control.
Command:
az storage fs create \
--account-name batchpipelinestorage \
--name raw \
--auth-mode login
az storage fs create \
--account-name batchpipelinestorage \
--name staging \
--auth-mode login
az storage fs create \
--account-name batchpipelinestorage \
--name processed \
--auth-mode login
Explanation:
- Three containers:
raw
(source data),staging
(intermediate), andprocessed
(final output). --auth-mode login
uses your Azure CLI credentials for authentication.
Why: ADF orchestrates the pipeline, connecting data sources, transformations, and sinks.
Command:
az datafactory create \
--name BatchP-ADF \
--resource-group BatchPipelineRG \
--location eastus
Explanation:
BatchP-ADF
is a unique name for the factory.- Placed in the same region as other resources to minimize latency.
Why: A linked service connects ADF to the Data Lake for data movement and storage.
Manual Step:
- Open ADF in the Azure Portal.
- Go to "Author" > "Manage" > "New Linked Service."
- Select "Azure Data Lake Storage Gen2."
- Enter:
- Name:
DataLakeLinkedService
- Storage Account:
batchpipelinestorage
- Authentication: Use "System Assigned Managed Identity" for security.
- Name:
- Test the connection and save.
Explanation:
- Managed Identity avoids hardcoding credentials, enhancing security.
- This links ADF to the lakehouse for read/write operations.
Screenshot: Show the "Linked Services" tab withDataLakeLinkedService
configured.
I ran into an error while testing the connection. The error says, I need to enable auto-resolve-integration-runtime on the ADF. At the time I could not find the edit tab for the ADF and because I'd just made this resource, So I started over and made a new data factory and this time enabled auto-resolve-integration-runtime in the configuration settings under the networking tab.
I had to recreate the "BatchP-ADF" pipeline to accept auto-resolve-integration-runtime.
Connected with private endpoint.
Then I waited for approval state to clear.
Connection approved along with interactive authroring being enabled.
Edited storage account (ADLS) Added a new role assignment, my user account assigned to storage blob contributer. Selected members, system assigned user identities and ADF resource as the selected member.
Then reviewed and assign.
Refreshed the private endpoint on ADF and tested the connection again this time was successful.
Afterwards I click create, to create the linked service.
Why: The pipeline defines the workflow: ingest → transform → load.
Manual Step:
- In ADF, go to "Author" > "Pipelines" > "New Pipeline."
- Name it
BatchProcessingPipeline
. - Add activities (detailed in Step 7).
Explanation:
- The pipeline is the core orchestration logic, chaining tasks in sequence.
Why: Ingesting from multiple sources (e.g., CSV, SQL) simulates real-world heterogeneity.
Manual Step:
- Add a "Copy Data" activity to the pipeline.
- Source: Upload a sample CSV to
raw
container (e.g., via Azure Portal or CLI). - Sink: Set to
staging
container viaDataLakeLinkedService
. - Configure mappings if needed.
CLI (Optional):
az storage blob upload \
--account-name batchpipelinestorage \
--container-name raw \
--name sample_data.csv \
--file ./sample_data.csv \
--auth-mode login
Explanation:
- The "Copy Data" activity moves data without transformation, preserving the raw state.
- Sample CSV mimics a common enterprise data format.
Why: Dataflows (Spark-based) handle complex transformations like cleansing and deduplication scalably.
Manual Step:
- In ADF, go to "Author" > "Dataflows" > "New Dataflow."
- Name it
TransformData
. - Source:
staging
container. - Add transformations (e.g., remove duplicates, filter nulls).
- Sink:
processed
container. - Link the Dataflow to the pipeline after the "Copy Data" activity.
Explanation:
- Spark ensures scalability for large datasets.
- Transformations improve data quality, a key Azure Data Engineer skill.
I renamed DelimitedText1 and DelimitedText2 to raw_data and staging_data.
Why: Incremental loading processes only new/changed data, optimizing performance.
Manual Step:
- Add a "Lookup" activity before "Copy Data" to check the last processed timestamp (stored in a metadata file).
- Use a "Filter" activity to select only new rows based on the timestamp.
- Update the metadata file post-run with the latest timestamp.
Explanation:
- Reduces redundant processing, critical for large-scale systems.
- Metadata-driven logic is an industry best practice.
Why: Validation ensures reliability; monitoring tracks performance.
Manual Step:
- In ADF, click "Debug" to test the pipeline.
- Go to "Monitor" to view run status and logs.
Explanation:
- Debugging catches errors early.
- Logs provide metrics (e.g., runtime, data volume) for optimization.