This project implements a real-time data streaming pipeline that ingests data from an external API, orchestrates workflows using Apache Airflow, and leverages Apache Kafka for reliable message queuing and decoupling between data producers and consumers. Incoming data is first stored in PostgreSQL for backup or auditing purposes, then published to Kafka topics. Kafka integrates with the Confluent Schema Registry to ensure consistent message formats, and its operations are monitored via the Kafka Control Center. Apache Spark, running in a distributed setup with master and worker nodes, subscribes to the Kafka topics to process, transform, and enrich the streaming data. The processed data is then stored in Cassandra, a high-performance NoSQL database optimized for real-time reads and writes. ZooKeeper manages the Kafka brokers, ensuring high availability and coordination. The entire system is containerized using Docker, enabling easy deployment, environment consistency, and scalability.
This project implements a complete ETL (Extract, Transform, Load) pipeline where:
- API data is pulled by Airflow DAGs.
- Airflow writes data to PostgreSQL (backup) and pushes it to Kafka topics.
- Kafka broadcasts the data to Spark for processing.
- Spark processes, transforms and pushes final results to Cassandra for querying or dashboarding.
- Schema Registry ensures all Kafka messages conform to an agreed format.
- Control Center and ZooKeeper manage and monitor the Kafka ecosystem.
- Everything runs on Docker, simplifying deployment
- π Apache Airflow - Workflow orchestration and scheduling
- π PostgreSQL- Relational backup and source of truth for incoming data
- π¨ Apache Kafka - Real-time data streaming
- ποΈ Apache Cassandra - NoSQL database for storage
- β‘ Apache Spark - Distributed data processing (cluster ready)
- π³ Docker - Containerization and orchestration
- π Python - Data processing and consumer logic
- Docker and Docker Compose
- 8GB+ RAM recommended
- Ports 8082, 9021, 8083, 9042 available
git clone https://github.com/Peter-Opapa/kafka_airflow_cassandra_pipeline.git
cd kafka_airflow_cassandra_pipeline
docker-compose up -d
All services need time to initialize. Monitor with:
docker-compose ps
Service | URL | Purpose |
---|---|---|
Airflow | http://localhost:8082 | Pipeline orchestration |
Kafka Control Center | http://localhost:9021 | Stream monitoring |
Spark Master | http://localhost:8083 | Cluster management |
Airflow Credentials:
- Username:
admin
- Password:
yk3DNHKWbWCHnzQV
Monitor DAG runs, task execution, and pipeline health
Real-time monitoring of Kafka topics, consumers, and message flow
Monitor Spark jobs, executors, and cluster resources
- Data Generation: Airflow DAG fetches user profiles from RandomUser API
- Message Publishing: User data is published to Kafka topic
users_created
- Stream Processing: Python consumer processes messages in real-time
- Data Storage: Processed records are stored in Cassandra
created_users
table
.
βββ docker-compose.yml # Infrastructure orchestration
βββ requirements.txt # Python dependencies
βββ dags/
β βββ kafka_stream.py # Airflow DAG for data ingestion
βββ src/
β βββ simple_stream.py # Kafka-to-Cassandra consumer
β βββ debug_pipeline.ps1 # Pipeline debugging script
βββ docs/
β βββ images/ # UI screenshots and diagrams
βββ logs/ # Airflow logs
βββ plugins/ # Airflow plugins
βββ script/ # Initialization scripts
CREATE TABLE created_users (
id UUID PRIMARY KEY,
first_name TEXT,
last_name TEXT,
gender TEXT,
email TEXT,
username TEXT,
phone TEXT,
address TEXT,
post_code TEXT,
dob TIMESTAMP,
registered_date TIMESTAMP,
picture TEXT,
timestamp TIMESTAMP
);
The data consumer runs automatically, but you can also run it manually:
# Copy consumer to Spark container
docker cp src/simple_stream.py spark-master:/simple_stream.py
# Install dependencies
docker exec spark-master pip install kafka-python cassandra-driver
# Run consumer
docker exec spark-master python /simple_stream.py
# Verify Kafka messages
docker exec broker kafka-console-consumer --bootstrap-server localhost:9092 --topic users_created --from-beginning --max-messages 5
# Check Cassandra data
docker exec cassandra cqlsh -e "SELECT COUNT(*) FROM spark_streams.created_users;"
# View sample records
docker exec cassandra cqlsh -e "SELECT first_name, last_name, email FROM spark_streams.created_users LIMIT 5;"
docker-compose ps
docker-compose logs [service-name]
Service | Container | Port | Description |
---|---|---|---|
Zookeeper | zookeeper | 2181 | Kafka coordination |
Kafka Broker | broker | 9092, 29092 | Message streaming |
Schema Registry | schema-registry | 8081 | Schema management |
Control Center | control-center | 9021 | Kafka monitoring |
Airflow | airflow-webserver | 8082 | Workflow management |
Postgres | postgres | 5432 | Airflow metadata |
Spark Master | spark-master | 8083 | Cluster coordination |
Spark Worker | spark-worker | 8084 | Task execution |
Cassandra | cassandra | 9042 | Data storage |
- Port Conflicts: Ensure ports 8082, 9021, 8083, 9042 are free
- Memory Issues: Increase Docker memory allocation to 8GB+
- Startup Time: Allow 2-3 minutes for all services to initialize
- Network Issues: All services use
kafka_confluent
Docker network
# Check all containers
docker-compose ps
# View service logs
docker-compose logs airflow-webserver
docker-compose logs broker
docker-compose logs cassandra
# Restart specific service
docker-compose restart [service-name]
# Clean restart
docker-compose down
docker-compose up -d
Run the debugging script to verify end-to-end flow:
# Windows PowerShell
.\src\debug_pipeline.ps1
# Linux/Mac (convert script)
bash src/debug_pipeline.sh
- Throughput: ~1 message/minute (configurable in Airflow DAG)
- Latency: <1 second from Kafka to Cassandra
- Scalability: Horizontally scalable with additional Spark workers
- Reliability: Automatic retry logic and error handling
- Fork the repository
- Create feature branch (
git checkout -b feature/amazing-feature
) - Commit changes (
git commit -m 'Add amazing feature'
) - Push to branch (
git push origin feature/amazing-feature
) - Open Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
- Apache Airflow Documentation
- Apache Kafka Documentation
- Apache Cassandra Documentation
- Confluent Platform Documentation
β If you find this project helpful, please give it a star!