GitHub - pvotio/ishares-components: iShares Components – A Python-based data pipeline that fetches, processes, and stores financial data related to iShares ETFs, integrating with Microsoft SQL Server for structured storage.

iShares Components

This repository contains an end-to-end Python ETL pipeline deployed to Azure Kubernetes Service (AKS). It scrapes key financial data on the constituents of various iShares ETFs, transforms the data, and loads it into an Azure SQL Database. It is containerized with Docker, uses Azure Managed Identity for passwordless authentication, and routes scraping requests through Bright Data proxy.

Core Workflow

Initialization: main.py sets up logging and leverages Azure Workload Identity (Managed Identity) for passwordless authentication to Azure services—no stored credentials required.
Scraping: ishare.py retrieves ETF constituent tickers and metadata from iShares, routing HTTP requests through the Bright Data proxy.
Transformation: transformer.py and utils.py normalize, enrich, and validate the scraped data.
Loading: mssql.py connects to Azure SQL Database using Azure AD authentication (via Managed Identity) and performs upserts of transformed records.
Containerization: Dockerfile builds a Docker image based on Python 3.13.2, installs dependencies, and sets the ETL command.
Deployment: Image is pushed to Azure Container Registry (ACR) and deployed on AKS.

Architecture

Scraping Module: Fetches data via Bright Data proxy (ishare.py).
Transformation Module: Cleans and formats data (transformer.py, utils.py).
Database Module: Interfaces with Azure SQL DB (mssql.py).
Orchestration: core.py coordinates module execution by Managed Identity.
Container & CI/CD: Built with Docker and deployed via Azure Pipelines.

Prerequisites

Python 3.13.2+
Docker
Azure CLI
Azure subscription with permissions to create:
- AKS cluster
- Azure SQL Database
- Azure Container Registry
kubectl

Getting Started

Clone the Repository

git clone https://github.com/your_org/shares-components.git
cd shares-components

Configuration

settings.py: Update Azure SQL and ACR connection strings.
logger.py: Configure log levels and handlers.
azure-pipelines.yaml: Adjust registry connection and image repository variables as needed.

Environment Variables

The pipeline relies on these environment variables, typically injected via Kubernetes secrets or ConfigMaps:

Variable	Description
`LOG_LEVEL`	Logging verbosity (e.g., INFO, DEBUG)
`OUTPUT_TABLE`	Target table in Azure SQL DB
`BRIGHTDATA_USER`	Bright Data API username
`BRIGHTDATA_PASSWD`	Bright Data API password
`BRIGHTDATA_PROXY`	Proxy host (e.g., `brd.superproxy.io`)
`BRIGHTDATA_PORT`	Proxy port (e.g., `33335`)
`MSSQL_AD_LOGIN`	Enable Azure AD authentication (true)
`MSSQL_SERVER`	Azure SQL server hostname
`MSSQL_DATABASE`	Azure SQL database name

Dockerfile Details

Base Image: python:3.13.2-slim-bullseye
System Dependencies: Includes ODBC driver (msodbcsql18), mssql-tools, and Unix ODBC headers.
Python Dependencies: Installs packages from requirements.txt without cache.
User Context: Creates a non-root client user for security.
Command: Executes python main.py.

Scheduled Execution

This ETL pipeline is deployed on AKS as a Kubernetes CronJob to run automatically:

CronJob Name: std-etl-ishares-components
Namespace: etl
Schedule: 0 11 * * 0 (11:00 AM every Sunday)
TimeZone: CET
History Limits:
- successfulJobsHistoryLimit: 5
- failedJobsHistoryLimit: 1
Security & Resources:
- Runs as non-root user (UID/GID 1000)
- Read-only root filesystem
- Privilege escalation disabled
- Resource requests: 500m CPU, 256Mi RAM
- Resource limits: 4000m CPU, 2048Mi RAM

Azure Resources

AKS Cluster: Hosts the Kubernetes deployment.
Azure SQL DB: Stores normalized ETF data.
ACR: Stores Docker images.

CI/CD with Azure Pipelines

The azure-pipelines.yaml defines a pipeline that:

Builds a Docker image.
Pushes the image to ACR with tags latest and timestamped builds.

Key variables in azure-pipelines.yaml:

variables:
  dockerRegistryServiceConnection: 'pa-azure-container-registry'
  imageRepository: 'shares-components'
  dockerfilePath: '$(Build.SourcesDirectory)/Dockerfile'
  tag: $[format('{0:yyyy}.{0:MM}{0:dd}', pipeline.startTime)]
  buildId: $(Build.BuildId)
steps:
  - task: Docker@2
    displayName: Build and Push Docker Image
    inputs:
      containerRegistry: $(dockerRegistryServiceConnection)
      repository: $(imageRepository)
      command: buildAndPush
      dockerfile: $(dockerfilePath)
      tags:
        - latest
        - $(tag).$(buildId)

Usage

Local Run (development/debugging):

export AZURE_SQL_...  # set necessary env vars
python main.py

AKS Deployment: Update your Kubernetes manifest or Helm chart to reference the ACR image shares-components:latest.

Project Structure

├── azure-pipelines.yaml      # CI/CD definitions
├── Dockerfile                # Container image: Python 3.13.2, ODBC drivers, requirements
├── main.py                   # Entry point, parses args, starts ETL
├── core.py                   # Core orchestration logic
├── ishare.py                 # Scraping module via Bright Data proxy
├── transformer.py            # Data normalization functions
├── utils.py                  # Helper utilities
├── mssql.py                  # Azure SQL DB connector & queries
├── helper.py                 # Shared helper functions
├── settings.py               # Configuration loader
├── logger.py                 # Logging setup
└── README.md                 # Project documentation

Contributing

Fork the repository.
Create a feature branch: git checkout -b feature/YourFeature
Commit your changes: git commit -m 'Add feature'
Push to the branch: git push origin feature/YourFeature
Create a Pull Request.

Please follow the project's code style and add tests where applicable.

License

This project is licensed under the Apache License 2.0. See LICENSE for details.

Contact

For questions or feedback, please open an issue or reach out to the maintainer at clem@pvot.io.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

iShares Components

Table of Contents

Core Workflow

Architecture

Prerequisites

Getting Started

Clone the Repository

Configuration

Environment Variables

Dockerfile Details

Scheduled Execution

Azure Resources

CI/CD with Azure Pipelines

Usage

Project Structure

Contributing

License

Contact

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
config		config
database		database
engine		engine
transformer		transformer
.dockerignore		.dockerignore
.env.sample		.env.sample
.flake8		.flake8
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
azure-pipelines.yaml		azure-pipelines.yaml
main.py		main.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

License

pvotio/ishares-components

Folders and files

Latest commit

History

Repository files navigation

iShares Components

Table of Contents

Core Workflow

Architecture

Prerequisites

Getting Started

Clone the Repository

Configuration

Environment Variables

Dockerfile Details

Scheduled Execution

Azure Resources

CI/CD with Azure Pipelines

Usage

Project Structure

Contributing

License

Contact

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages