Your modern data stack playground. Spin up core components of a real data platform and practice end‑to‑end workflows locally.
Instead of just reading about "data lakes" or "lakehouses," you actually get to run them. Think of it as a gym for data engineers without cloud bills or production risk.
Data Forge includes a complete modern data stack with industry-standard tools:
- MinIO → S3-compatible object storage for data lakes
- Hive Metastore → Centralized metadata catalog for tables and schemas
- Trino → Interactive SQL query engine for federated analytics
- Apache Spark → Distributed processing for batch and streaming workloads
- Apache Kafka → Event streaming platform
- Schema Registry → Schema evolution and compatibility
- Debezium → Change data capture from databases
- PostgreSQL → Primary OLTP database (source system)
- ClickHouse → Columnar analytics database (sink)
- Apache Airflow 3 → Workflow orchestration
- Apache Superset → Modern BI and data visualization
- JupyterLab → Interactive data science environment
- Data Generator → Realistic retail data producer for Kafka topics and Postgres tables (see infra/data-generator/README.md)
- Docker 20.10+
- Docker Compose 2.0+
- 8GB+ RAM recommended
- 20GB+ disk space for all services
git clone https://github.com/fortiql/data-forge.git
cd data-forge
# Copy environment template
cp .env.example .env
# Review and adjust settings
nano .env
# Start essential data stack (MinIO, Postgres, ClickHouse, etc.)
docker compose --profile core up -d
# Wait for services to be healthy
docker compose ps
# Add Airflow for orchestration
docker compose --profile airflow up -d
# Add exploration tools
docker compose --profile explore up -d
# Add realistic data generation
docker compose --profile datagen up -d
Service | URL | Default Login |
---|---|---|
Kafka UI | http://localhost:8082 | No auth |
Airflow | http://localhost:8085 | airlfow / airflow |
Superset | http://localhost:8089 | admin / admin |
MinIO Console | http://localhost:9001 | minio / minio123 |
Trino | http://localhost:8080 | No auth |
See docs/architecture.md for profile details and commands.
Follow docs/learning-path.md for a concise, runnable sequence of notebooks.
See docs/development.md for project layout, env vars, and contribution tips.
- Open issues or PRs with clear scope and steps.
- Follow the docs style: docs/guidelines.md.
- Test changes with the relevant compose profiles.
MIT — see LICENSE
. See service docs for third‑party licenses.
- Docs entrypoint: docs/
- Architecture: docs/architecture.md
- Learning Path: docs/learning-path.md
- Development: docs/development.md
- Troubleshooting: docs/troubleshooting.md
- Service docs index: docs/services.md
- Guidelines (please read): docs/guidelines.md
Service docs (direct links):
- MinIO: infra/minio/README.md
- Trino: infra/trino/README.md
- Spark: infra/spark/README.md
- Airflow: infra/airflow/README.md
- ClickHouse: infra/clickhouse/README.md
- Kafka: infra/kafka/README.md
- Schema Registry: infra/schema-registry/README.md
- Hive Metastore: infra/hive-metastore/README.md
- JupyterLab: infra/jupyterlab/README.md
- Superset: infra/superset/README.md
- Postgres: infra/postgres/README.md
- Redis: infra/redis/README.md
- Debezium: infra/debezium/README.md
- Kafka UI: infra/kafka-ui/README.md
Built on the shoulders of open‑source communities: Apache (Airflow, Spark, Kafka, Trino), ClickHouse, MinIO, Jupyter, Superset, Redis.
The project name "Forge" fits: it's a place where raw metal (data) is hammered into something structured and useful, with you as the smith learning the craft. ⚒️
See infra/data-generator/README.md for usage and configuration.