This project is intended to implement a data pipeline as POC on an open source data warehouse stack. It serves as and alternative to the popular Big Tech Cloud Data Warehouses like Databricks, Snowflake, or BigQuery.
It uses the following open source components:
- Apache Airflow for orchestration
- ClickHouse for warehousing
- dbt for data transformation
The project is designed to be deployed on Scaleway European Cloud infrastructure using Kubernetes and Terraform.
It uses the following Scaleway services:
- Kubernetes for container orchestration
- Object Storage for data storage
- Secrets Manager for managing sensitive information
[IMPORTANT] This project is a proof of concept and not intended for production use. It is meant to demonstrate the capabilities of an open source data warehouse stack.
The data pipeline example collects weather forecast data from a public API, processes it, and stores it in ClickHouse for further analysis.
The pipeline consists of the following components:
- Weather Data Collector: Python service that fetches weather forecasts from a public API
- Ingestion Layer: Classes that handle data extraction and loading into S3 storage
- Storage Client: Manages interaction with Scaleway Object Storage for raw data persistence
- ClickHouse Client: Handles connections and data writing to ClickHouse
- Airflow DAG: Orchestrates the entire pipeline with scheduled execution
The src/
directory contains all Python source code for the data pipeline:
data_pipeline/
: Root package for all pipeline componentsingestion/
: Data ingestion servicesweather_data_collector.py
: Core module for retrieving weather forecastsclients/
: API clients for external servicesweather.py
: Client for the weather forecast APIstorage.py
: Client for S3-compatible object storageclickhouse.py
: Client for ClickHouse database operationsbase.py
: Base client classes and interfaces
models/
: Data models for pipeline entitieslocation.py
: Represents geographical locations for weather datascaleway_storage.py
: Models for interacting with cloud storageclickhouse.py
: Data structures for ClickHouse integration
configs/
: Configuration settings and constantsdata/
: Static data files (e.g., dutch_cities.txt)utils/
: Helper functions and utilities
The pipeline uses Airflow for orchestration, with DAGs defined in the dags/
directory. Weather data is first collected and stored as raw JSON in Scaleway Object Storage before being processed and loaded into ClickHouse using the dbt models described in the next section.
The project uses dbt (data build tool) to transform raw weather data into structured, analytics-ready datasets following a layered approach:
- raw_weather: Ingests JSON weather data directly from S3 storage, preserving the raw content and adding a load timestamp. This model uses ClickHouse's native S3 functions to read data from the object storage bucket.
- stg_weather: Transforms raw JSON data into a structured format by extracting specific weather attributes such as temperature, precipitation, and wind speed. This incremental model processes only new data since the last run, extracts city information, coordinates, and hourly weather metrics.
- mart_daily_weather_summary: Provides aggregated daily weather statistics by city. This model calculates min/max/avg temperatures, wind speeds, and precipitation totals to support analytics use cases.
dags/
: Airflow DAGs for weather data collectiondbt/
: Data transformation models using dbtk8s/
: Kubernetes configuration for deployment - see K8s READMEsrc/
: Python source code for the data pipelineterraform/
: Infrastructure as code for Scaleway setup - see Terraform README
- Python 3.11 or later
- Docker and Docker Compose
- UV package manager
Clone the repository:
git clone <repository-url>
cd open-source-data-warehouse-poc
For deployment of the open source warehouse, we use Scaleway's infrastructure with Kubernetes and Terraform.
- Infrastructure: Deploy cloud resources with Terraform - see Terraform README
- Kubernetes: Deploy applications on K8s cluster - see K8s README
- Scaleway CLI installed and configured
- Terraform installed
- kubectl installed
- Helm v3.x installed
- Docker installed
The weather data collection pipeline runs on a schedule defined in the Airflow DAG. You can also trigger it manually from the Airflow UI.
dbt models are used for transforming the raw weather data into useful analytics tables. To run the dbt models:
cd dbt
dbt run
After deploying the Airflow DAG, you can trigger the data pipeline manually or wait for the scheduled runs. The DAG will execute the following tasks:
collect_weather_data
: Collect weather data from the public API and store it in Scaleway Object Storagetransform_weather_data
: Load the raw data from S3 into ClickHouse and apply transformations using dbt models
The project is a proof of concept and not intended for production use. However, if you plan to build from here, consider the following improvements:
- Separate infrastructure, k8s manifests, ingestion and transformation code into different repositories.
- Re-evaluate secret management strategy, possibly using an external secrets manager with helm secrets plugin.
- Use artifact registry for ingestion Python code.
- Enable TLS/SSL and network security policies
- Check production guidelines - Airflow Production Guidelines.
- Use managed Postgres as Airflow metastore.
- Sync dags from Git repo instead of building them into the Docker image.
- Store logs in Scaleway Object Storage.
- Add Redis for caching and task queue.
- Adjust computation resources based on workload requirements.
- Add ClickHouse 3rd party interface for better user experience - options
- Add ArgoCD for continuous deployment and monitoring of deployments.
- Add open-source catalog such as unity or LakeKeeper for governance and discovery.