🔥 Data Forge — Data Engineering Playground

Your modern data stack playground. Spin up core components of a real data platform and practice end‑to‑end workflows locally.

Instead of just reading about "data lakes" or "lakehouses," you actually get to run them. Think of it as a gym for data engineers without cloud bills or production risk.

🎯 What's Inside

Data Forge includes a complete modern data stack with industry-standard tools:

🗄️ Storage & Catalog

MinIO → S3-compatible object storage for data lakes
Hive Metastore → Centralized metadata catalog for tables and schemas

⚡ Compute Engines

Trino → Interactive SQL query engine for federated analytics
Apache Spark → Distributed processing for batch and streaming workloads

🌊 Streaming & CDC

Apache Kafka → Event streaming platform
Schema Registry → Schema evolution and compatibility
Debezium → Change data capture from databases

🗃️ Databases

PostgreSQL → Primary OLTP database (source system)
ClickHouse → Columnar analytics database (sink)

🔄 Orchestration

Apache Airflow 3 → Workflow orchestration

📊 Visualization & Exploration

Apache Superset → Modern BI and data visualization
JupyterLab → Interactive data science environment

🏭 Data Generation

Data Generator → Realistic retail data producer for Kafka topics and Postgres tables (see infra/data-generator/README.md)

🚀 Quick Start

Prerequisites

Docker 20.10+
Docker Compose 2.0+
8GB+ RAM recommended
20GB+ disk space for all services

1. Clone & Configure

git clone https://github.com/fortiql/data-forge.git
cd data-forge

# Copy environment template
cp .env.example .env

# Review and adjust settings
nano .env

2. Start Core Services

# Start essential data stack (MinIO, Postgres, ClickHouse, etc.)
docker compose --profile core up -d

# Wait for services to be healthy
docker compose ps

3. Add Compute & Orchestration

# Add Airflow for orchestration
docker compose --profile airflow up -d

# Add exploration tools
docker compose --profile explore up -d

# Add realistic data generation
docker compose --profile datagen up -d

4. Access the Stack

Service	URL	Default Login
Kafka UI	http://localhost:8082	No auth
Airflow	http://localhost:8085	`airlfow` / `airflow`
Superset	http://localhost:8089	`admin` / `admin`
MinIO Console	http://localhost:9001	`minio` / `minio123`
Trino	http://localhost:8080	No auth

🧩 Architecture Profiles

See docs/architecture.md for profile details and commands.

📚 Learning Path

Follow docs/learning-path.md for a concise, runnable sequence of notebooks.

🛠️ Development

See docs/development.md for project layout, env vars, and contribution tips.

🤝 Contributing

Open issues or PRs with clear scope and steps.
Follow the docs style: docs/guidelines.md.
Test changes with the relevant compose profiles.

📄 License

MIT — see LICENSE. See service docs for third‑party licenses.

🌟 Resources

Docs entrypoint: docs/
Architecture: docs/architecture.md
Learning Path: docs/learning-path.md
Development: docs/development.md
Troubleshooting: docs/troubleshooting.md
Service docs index: docs/services.md
Guidelines (please read): docs/guidelines.md

Service docs (direct links):

MinIO: infra/minio/README.md
Trino: infra/trino/README.md
Spark: infra/spark/README.md
Airflow: infra/airflow/README.md
ClickHouse: infra/clickhouse/README.md
Kafka: infra/kafka/README.md
Schema Registry: infra/schema-registry/README.md
Hive Metastore: infra/hive-metastore/README.md
JupyterLab: infra/jupyterlab/README.md
Superset: infra/superset/README.md
Postgres: infra/postgres/README.md
Redis: infra/redis/README.md
Debezium: infra/debezium/README.md
Kafka UI: infra/kafka-ui/README.md

🙏 Acknowledgments

Built on the shoulders of open‑source communities: Apache (Airflow, Spark, Kafka, Trino), ClickHouse, MinIO, Jupyter, Superset, Redis.

The project name "Forge" fits: it's a place where raw metal (data) is hammered into something structured and useful, with you as the smith learning the craft. ⚒️

🏭 Data Generation

See infra/data-generator/README.md for usage and configuration.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🔥 Data Forge — Data Engineering Playground

🎯 What's Inside

🗄️ Storage & Catalog

⚡ Compute Engines

🌊 Streaming & CDC

🗃️ Databases

🔄 Orchestration

📊 Visualization & Exploration

🏭 Data Generation

🚀 Quick Start

Prerequisites

1. Clone & Configure

2. Start Core Services

3. Add Compute & Orchestration

4. Access the Stack

🧩 Architecture Profiles

📚 Learning Path

🛠️ Development

See docs/development.md for project layout, env vars, and contribution tips.

🤝 Contributing

📄 License

🌟 Resources

🙏 Acknowledgments

🏭 Data Generation

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
.devcontainer		.devcontainer
docs		docs
infra		infra
notebooks		notebooks
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml

License

fortiql/data-forge

Folders and files

Latest commit

History

Repository files navigation

🔥 Data Forge — Data Engineering Playground

🎯 What's Inside

🗄️ Storage & Catalog

⚡ Compute Engines

🌊 Streaming & CDC

🗃️ Databases

🔄 Orchestration

📊 Visualization & Exploration

🏭 Data Generation

🚀 Quick Start

Prerequisites

1. Clone & Configure

2. Start Core Services

3. Add Compute & Orchestration

4. Access the Stack

🧩 Architecture Profiles

📚 Learning Path

🛠️ Development

See docs/development.md for project layout, env vars, and contribution tips.

🤝 Contributing

📄 License

🌟 Resources

🙏 Acknowledgments

🏭 Data Generation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages