Product Recall Streaming Pipeline

This project uses different tools such as kafka, airflow, spark, postgres and docker.

Overview

The data pipeline consists of three main stages:

Data Streaming:
Data is initially streamed from an external API into a Kafka topic. This simulates real-time data ingestion into the system.
Data Processing:
A Spark job consumes the data from the Kafka topic and processes it before saving the results into a PostgreSQL database.

Orchestration with Airflow:
The entire workflow — including the Kafka streaming task and the Spark processing job — is orchestrated using Apache Airflow.

All components are containerized and managed using Docker and docker-compose, ensuring easy setup, portability, and scalability.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
airflow_resources		airflow_resources
data		data
img		img
scripts		scripts
spark		spark
src		src
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
docker-compose-airflow.yaml		docker-compose-airflow.yaml
docker-compose.yml		docker-compose.yml
requirements.txt		requirements.txt