Skip to content

A real-time data pipeline that streams product recall events from an external API to Kafka, processes the data using PySpark, stores results in PostgreSQL, and orchestrates everything with Apache Airflow — all containerized using Docker.

License

Notifications You must be signed in to change notification settings

Redgerd/data_pipeline_rappelconso

Repository files navigation

Product Recall Streaming Pipeline

Kafka Airflow PySpark Docker

This project uses different tools such as kafka, airflow, spark, postgres and docker.

alt text

Overview

The data pipeline consists of three main stages:

  1. Data Streaming:
    Data is initially streamed from an external API into a Kafka topic. This simulates real-time data ingestion into the system.

  2. Data Processing:
    A Spark job consumes the data from the Kafka topic and processes it before saving the results into a PostgreSQL database.

image

  1. Orchestration with Airflow:
    The entire workflow — including the Kafka streaming task and the Spark processing job — is orchestrated using Apache Airflow.

image

Deployment

All components are containerized and managed using Docker and docker-compose, ensuring easy setup, portability, and scalability.

About

A real-time data pipeline that streams product recall events from an external API to Kafka, processes the data using PySpark, stores results in PostgreSQL, and orchestrates everything with Apache Airflow — all containerized using Docker.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published