This repository contains an end-to-end Python ETL pipeline deployed to Azure Kubernetes Service (AKS). It scrapes key financial data on the constituents of various iShares ETFs, transforms the data, and loads it into an Azure SQL Database. It is containerized with Docker, uses Azure Managed Identity for passwordless authentication, and routes scraping requests through Bright Data proxy.
- Overview
- Core Workflow
- Architecture
- Prerequisites
- Getting Started
- Dockerfile Details
- Scheduled Execution
- Azure Resources
- CI/CD with Azure Pipelines
- Usage
- Project Structure
- Contributing
- License
- Contact
- Initialization:
main.py
sets up logging and leverages Azure Workload Identity (Managed Identity) for passwordless authentication to Azure services—no stored credentials required. - Scraping:
ishare.py
retrieves ETF constituent tickers and metadata from iShares, routing HTTP requests through the Bright Data proxy. - Transformation:
transformer.py
andutils.py
normalize, enrich, and validate the scraped data. - Loading:
mssql.py
connects to Azure SQL Database using Azure AD authentication (via Managed Identity) and performs upserts of transformed records. - Containerization: Dockerfile builds a Docker image based on Python 3.13.2, installs dependencies, and sets the ETL command.
- Deployment: Image is pushed to Azure Container Registry (ACR) and deployed on AKS.
- Scraping Module: Fetches data via Bright Data proxy (
ishare.py
). - Transformation Module: Cleans and formats data (
transformer.py
,utils.py
). - Database Module: Interfaces with Azure SQL DB (
mssql.py
). - Orchestration:
core.py
coordinates module execution by Managed Identity. - Container & CI/CD: Built with Docker and deployed via Azure Pipelines.
- Python 3.13.2+
- Docker
- Azure CLI
- Azure subscription with permissions to create:
- AKS cluster
- Azure SQL Database
- Azure Container Registry
kubectl
git clone https://github.com/your_org/shares-components.git
cd shares-components
- settings.py: Update Azure SQL and ACR connection strings.
- logger.py: Configure log levels and handlers.
- azure-pipelines.yaml: Adjust registry connection and image repository variables as needed.
The pipeline relies on these environment variables, typically injected via Kubernetes secrets or ConfigMaps:
Variable | Description |
---|---|
LOG_LEVEL |
Logging verbosity (e.g., INFO, DEBUG) |
OUTPUT_TABLE |
Target table in Azure SQL DB |
BRIGHTDATA_USER |
Bright Data API username |
BRIGHTDATA_PASSWD |
Bright Data API password |
BRIGHTDATA_PROXY |
Proxy host (e.g., brd.superproxy.io ) |
BRIGHTDATA_PORT |
Proxy port (e.g., 33335 ) |
MSSQL_AD_LOGIN |
Enable Azure AD authentication (true) |
MSSQL_SERVER |
Azure SQL server hostname |
MSSQL_DATABASE |
Azure SQL database name |
- Base Image:
python:3.13.2-slim-bullseye
- System Dependencies: Includes ODBC driver (
msodbcsql18
),mssql-tools
, and Unix ODBC headers. - Python Dependencies: Installs packages from
requirements.txt
without cache. - User Context: Creates a non-root
client
user for security. - Command: Executes
python main.py
.
This ETL pipeline is deployed on AKS as a Kubernetes CronJob to run automatically:
- CronJob Name:
std-etl-ishares-components
- Namespace:
etl
- Schedule:
0 11 * * 0
(11:00 AM every Sunday) - TimeZone: CET
- History Limits:
successfulJobsHistoryLimit: 5
failedJobsHistoryLimit: 1
- Security & Resources:
- Runs as non-root user (UID/GID 1000)
- Read-only root filesystem
- Privilege escalation disabled
- Resource requests: 500m CPU, 256Mi RAM
- Resource limits: 4000m CPU, 2048Mi RAM
- AKS Cluster: Hosts the Kubernetes deployment.
- Azure SQL DB: Stores normalized ETF data.
- ACR: Stores Docker images.
The azure-pipelines.yaml
defines a pipeline that:
- Builds a Docker image.
- Pushes the image to ACR with tags
latest
and timestamped builds.
Key variables in azure-pipelines.yaml
:
variables:
dockerRegistryServiceConnection: 'pa-azure-container-registry'
imageRepository: 'shares-components'
dockerfilePath: '$(Build.SourcesDirectory)/Dockerfile'
tag: $[format('{0:yyyy}.{0:MM}{0:dd}', pipeline.startTime)]
buildId: $(Build.BuildId)
steps:
- task: Docker@2
displayName: Build and Push Docker Image
inputs:
containerRegistry: $(dockerRegistryServiceConnection)
repository: $(imageRepository)
command: buildAndPush
dockerfile: $(dockerfilePath)
tags:
- latest
- $(tag).$(buildId)
Local Run (development/debugging):
export AZURE_SQL_... # set necessary env vars
python main.py
AKS Deployment: Update your Kubernetes manifest or Helm chart to reference the ACR image shares-components:latest
.
├── azure-pipelines.yaml # CI/CD definitions
├── Dockerfile # Container image: Python 3.13.2, ODBC drivers, requirements
├── main.py # Entry point, parses args, starts ETL
├── core.py # Core orchestration logic
├── ishare.py # Scraping module via Bright Data proxy
├── transformer.py # Data normalization functions
├── utils.py # Helper utilities
├── mssql.py # Azure SQL DB connector & queries
├── helper.py # Shared helper functions
├── settings.py # Configuration loader
├── logger.py # Logging setup
└── README.md # Project documentation
- Fork the repository.
- Create a feature branch:
git checkout -b feature/YourFeature
- Commit your changes:
git commit -m 'Add feature'
- Push to the branch:
git push origin feature/YourFeature
- Create a Pull Request.
Please follow the project's code style and add tests where applicable.
This project is licensed under the Apache License 2.0. See LICENSE for details.
For questions or feedback, please open an issue or reach out to the maintainer at clem@pvot.io
.