This repository sets up a complete data lake and analytics environment using Docker Compose. It integrates multiple services like Kafka, Trino, Spark, MinIO, Elasticsearch, Kibana, and Glue Jupyter Notebooks for seamless data ingestion, querying, and analytics.
- Removed Zookeeper and transitioned to KRaft Mode for Kafka to simplify the setup and improve performance.
- Removed Hive Metastore to streamline the architecture and reduce dependencies.
- Description: Distributed streaming platform for building real-time data pipelines, now using KRaft mode (no Zookeeper required).
- Ports:
9092
(Internal Broker)29092
(External Broker)
- Description: Web interface for managing and monitoring Kafka clusters.
- URL: http://localhost:8083
- Port:
8083
- Description: Distributed search and analytics engine.
- URL: http://localhost:9201
- Ports:
9201
(HTTP API)9301
(Transport Layer)
- Description: Visualization tool for Elasticsearch.
- URL: http://localhost:5602
- Port:
5602
- Description: High-performance object storage compatible with AWS S3.
- Console URL: http://localhost:9001
- Ports:
9000
(S3 API)9001
(MinIO Console)
- Credentials:
- Access Key:
minioadmin
- Secret Key:
minioadmin
- Access Key:
- Description: Distributed SQL query engine for big data.
- URL: http://localhost:8080
- Port:
8080
- Description: AWS Glue environment for running data transformation scripts.
- URL: http://localhost:8888
- Port:
8888
- Install Docker
- Install Docker Compose
# Navigate to the project directory
cd path/to/your/repository
# Start all services
docker-compose up -d
docker-compose down
docker-compose up --build -d
Create a Topic:
docker exec -it kafka kafka-topics.sh --create --topic test_topic --bootstrap-server localhost:9092
Produce a Message:
docker exec -it kafka kafka-console-producer.sh --broker-list localhost:9092 --topic test_topic
Type your message and press Enter.
Consume a Message:
docker exec -it kafka kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic test_topic --from-beginning
Access Trino CLI:
docker exec -it trino trino
Run a Query in Trino:
SHOW CATALOGS;
Visit http://localhost:8888 to use AWS Glue Jupyter Notebooks for data transformation.
- MinIO Data:
minio_data
- Postgres Data:
postgres_data
- Elastic Data:
elastic_data
These volumes ensure your data persists even if containers are restarted.
- Access Key:
minioadmin
- Secret Key:
minioadmin
- Access Key:
minioadmin
- Secret Key:
minioadmin
docker logs kafka # Replace 'kafka' with any service name to check logs
Ensure all services are connected to the iceberg_net
network:
docker network inspect iceberg_net
This project is licensed under the MIT License.
Enjoy building your data lake and analytics environment! 🚀