This project uses different tools such as kafka, airflow, spark, postgres and docker.
The data pipeline consists of three main stages:
-
Data Streaming:
Data is initially streamed from an external API into a Kafka topic. This simulates real-time data ingestion into the system. -
Data Processing:
A Spark job consumes the data from the Kafka topic and processes it before saving the results into a PostgreSQL database.
- Orchestration with Airflow:
The entire workflow — including the Kafka streaming task and the Spark processing job — is orchestrated using Apache Airflow.
All components are containerized and managed using Docker and docker-compose, ensuring easy setup, portability, and scalability.