Skip to content

Data Forge — a modern data stack playground to practice flows and best practices, not just tools. Spark, Trino, Kafka, Iceberg, ClickHouse, Airflow, MinIO, Superset — all wired together locally with Docker Compose.

License

Notifications You must be signed in to change notification settings

fortiql/data-forge

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

37 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🔥 Data Forge — Data Engineering Playground

Your modern data stack playground. Spin up core components of a real data platform and practice end‑to‑end workflows locally.

License: MIT Docker Docker Compose

Instead of just reading about "data lakes" or "lakehouses," you actually get to run them. Think of it as a gym for data engineers without cloud bills or production risk.


🎯 What's Inside

Data Forge includes a complete modern data stack with industry-standard tools:

🗄️ Storage & Catalog

  • MinIO → S3-compatible object storage for data lakes
  • Hive Metastore → Centralized metadata catalog for tables and schemas

⚡ Compute Engines

  • Trino → Interactive SQL query engine for federated analytics
  • Apache Spark → Distributed processing for batch and streaming workloads

🌊 Streaming & CDC

  • Apache Kafka → Event streaming platform
  • Schema Registry → Schema evolution and compatibility
  • Debezium → Change data capture from databases

🗃️ Databases

  • PostgreSQL → Primary OLTP database (source system)
  • ClickHouse → Columnar analytics database (sink)

🔄 Orchestration

  • Apache Airflow 3 → Workflow orchestration

📊 Visualization & Exploration

  • Apache Superset → Modern BI and data visualization
  • JupyterLab → Interactive data science environment

🏭 Data Generation


🚀 Quick Start

Prerequisites

  • Docker 20.10+
  • Docker Compose 2.0+
  • 8GB+ RAM recommended
  • 20GB+ disk space for all services

1. Clone & Configure

git clone https://github.com/fortiql/data-forge.git
cd data-forge

# Copy environment template
cp .env.example .env

# Review and adjust settings
nano .env

2. Start Core Services

# Start essential data stack (MinIO, Postgres, ClickHouse, etc.)
docker compose --profile core up -d

# Wait for services to be healthy
docker compose ps

3. Add Compute & Orchestration

# Add Airflow for orchestration
docker compose --profile airflow up -d

# Add exploration tools
docker compose --profile explore up -d

# Add realistic data generation
docker compose --profile datagen up -d

4. Access the Stack

Service URL Default Login
Kafka UI http://localhost:8082 No auth
Airflow http://localhost:8085 airlfow / airflow
Superset http://localhost:8089 admin / admin
MinIO Console http://localhost:9001 minio / minio123
Trino http://localhost:8080 No auth

🧩 Architecture Profiles

See docs/architecture.md for profile details and commands.


📚 Learning Path

Follow docs/learning-path.md for a concise, runnable sequence of notebooks.


🛠️ Development

See docs/development.md for project layout, env vars, and contribution tips.

🤝 Contributing

  • Open issues or PRs with clear scope and steps.
  • Follow the docs style: docs/guidelines.md.
  • Test changes with the relevant compose profiles.

📄 License

MIT — see LICENSE. See service docs for third‑party licenses.


🌟 Resources

Service docs (direct links):


🙏 Acknowledgments

Built on the shoulders of open‑source communities: Apache (Airflow, Spark, Kafka, Trino), ClickHouse, MinIO, Jupyter, Superset, Redis.


The project name "Forge" fits: it's a place where raw metal (data) is hammered into something structured and useful, with you as the smith learning the craft. ⚒️


🏭 Data Generation

See infra/data-generator/README.md for usage and configuration.

About

Data Forge — a modern data stack playground to practice flows and best practices, not just tools. Spark, Trino, Kafka, Iceberg, ClickHouse, Airflow, MinIO, Superset — all wired together locally with Docker Compose.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published