Open source data warehouse

This project is intended to implement a data pipeline as POC on an open source data warehouse stack. It serves as and alternative to the popular Big Tech Cloud Data Warehouses like Databricks, Snowflake, or BigQuery.

It uses the following open source components:

Apache Airflow for orchestration
ClickHouse for warehousing
dbt for data transformation

The project is designed to be deployed on Scaleway European Cloud infrastructure using Kubernetes and Terraform.

It uses the following Scaleway services:

Kubernetes for container orchestration
Object Storage for data storage
Secrets Manager for managing sensitive information

[IMPORTANT] This project is a proof of concept and not intended for production use. It is meant to demonstrate the capabilities of an open source data warehouse stack.

Architecture Overview

Data Pipeline

The data pipeline example collects weather forecast data from a public API, processes it, and stores it in ClickHouse for further analysis.

Ingestion

The pipeline consists of the following components:

Weather Data Collector: Python service that fetches weather forecasts from a public API
Ingestion Layer: Classes that handle data extraction and loading into S3 storage
Storage Client: Manages interaction with Scaleway Object Storage for raw data persistence
ClickHouse Client: Handles connections and data writing to ClickHouse
Airflow DAG: Orchestrates the entire pipeline with scheduled execution

The src/ directory contains all Python source code for the data pipeline:

data_pipeline/: Root package for all pipeline components
- ingestion/: Data ingestion services
  - weather_data_collector.py: Core module for retrieving weather forecasts
  - clients/: API clients for external services
    - weather.py: Client for the weather forecast API
    - storage.py: Client for S3-compatible object storage
    - clickhouse.py: Client for ClickHouse database operations
    - base.py: Base client classes and interfaces
  - models/: Data models for pipeline entities
    - location.py: Represents geographical locations for weather data
    - scaleway_storage.py: Models for interacting with cloud storage
    - clickhouse.py: Data structures for ClickHouse integration
  - configs/: Configuration settings and constants
  - data/: Static data files (e.g., dutch_cities.txt)
  - utils/: Helper functions and utilities

The pipeline uses Airflow for orchestration, with DAGs defined in the dags/ directory. Weather data is first collected and stored as raw JSON in Scaleway Object Storage before being processed and loaded into ClickHouse using the dbt models described in the next section.

Transformation

The project uses dbt (data build tool) to transform raw weather data into structured, analytics-ready datasets following a layered approach:

Raw Layer

raw_weather: Ingests JSON weather data directly from S3 storage, preserving the raw content and adding a load timestamp. This model uses ClickHouse's native S3 functions to read data from the object storage bucket.

Staging Layer

stg_weather: Transforms raw JSON data into a structured format by extracting specific weather attributes such as temperature, precipitation, and wind speed. This incremental model processes only new data since the last run, extracts city information, coordinates, and hourly weather metrics.

Mart Layer

mart_daily_weather_summary: Provides aggregated daily weather statistics by city. This model calculates min/max/avg temperatures, wind speeds, and precipitation totals to support analytics use cases.

Project Structure

dags/: Airflow DAGs for weather data collection
dbt/: Data transformation models using dbt
k8s/: Kubernetes configuration for deployment - see K8s README
src/: Python source code for the data pipeline
terraform/: Infrastructure as code for Scaleway setup - see Terraform README

Local Development

Prerequisites

Python 3.11 or later
Docker and Docker Compose
UV package manager

Setup

Clone the repository:

git clone <repository-url>
cd open-source-data-warehouse-poc

Deployment

For deployment of the open source warehouse, we use Scaleway's infrastructure with Kubernetes and Terraform.

Quick Start

Infrastructure: Deploy cloud resources with Terraform - see Terraform README
Kubernetes: Deploy applications on K8s cluster - see K8s README

Prerequisites

Scaleway CLI installed and configured
Terraform installed
kubectl installed
Helm v3.x installed
Docker installed

Running the Pipeline

The weather data collection pipeline runs on a schedule defined in the Airflow DAG. You can also trigger it manually from the Airflow UI.

Data Transformation

dbt models are used for transforming the raw weather data into useful analytics tables. To run the dbt models:

cd dbt
dbt run

Airflow DAG run

After deploying the Airflow DAG, you can trigger the data pipeline manually or wait for the scheduled runs. The DAG will execute the following tasks:

collect_weather_data: Collect weather data from the public API and store it in Scaleway Object Storage
transform_weather_data: Load the raw data from S3 into ClickHouse and apply transformations using dbt models

Production Guidelines

The project is a proof of concept and not intended for production use. However, if you plan to build from here, consider the following improvements:

Generic improvements

Separate infrastructure, k8s manifests, ingestion and transformation code into different repositories.
Re-evaluate secret management strategy, possibly using an external secrets manager with helm secrets plugin.
Use artifact registry for ingestion Python code.
Enable TLS/SSL and network security policies

Airflow improvements

Check production guidelines - Airflow Production Guidelines.
Use managed Postgres as Airflow metastore.
Sync dags from Git repo instead of building them into the Docker image.
Store logs in Scaleway Object Storage.
Add Redis for caching and task queue.

ClickHouse improvements

Adjust computation resources based on workload requirements.
Add ClickHouse 3rd party interface for better user experience - options

Additional components

Add ArgoCD for continuous deployment and monitoring of deployments.
Add open-source catalog such as unity or LakeKeeper for governance and discovery.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
dags		dags
dbt		dbt
images		images
k8s		k8s
src		src
terraform		terraform
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
.python-version		.python-version
Dockerfile		Dockerfile
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Open source data warehouse

Architecture Overview

Data Pipeline

Ingestion

Transformation

Raw Layer

Staging Layer

Mart Layer

Project Structure

Local Development

Prerequisites

Setup

Deployment

Quick Start

Prerequisites

Running the Pipeline

Data Transformation

Airflow DAG run

Production Guidelines

Generic improvements

Airflow improvements

ClickHouse improvements

Additional components

About

Uh oh!

Releases

Packages

Languages

digital-power/open-source-data-warehouse-poc

Folders and files

Latest commit

History

Repository files navigation

Open source data warehouse

Architecture Overview

Data Pipeline

Ingestion

Transformation

Raw Layer

Staging Layer

Mart Layer

Project Structure

Local Development

Prerequisites

Setup

Deployment

Quick Start

Prerequisites

Running the Pipeline

Data Transformation

Airflow DAG run

Production Guidelines

Generic improvements

Airflow improvements

ClickHouse improvements

Additional components

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages