Took on this project because I was reading about Uber and decided to try out a ride sharing event pipeline, assuming I was to build one for my startup, this project implements a real-time data pipeline for processing ride-sharing events using Apache Kafka, Apache Spark, Apache Pinot (for streaming Analytics), Apache Superset for visualizations and Delta Lake, orchestrated with Apache Airflow. (Needs Improvement though).
The pipeline consists of the following components, check assets/data-pipeline-graph.md
A brief:
-
Kafka Producers
ride_event_producer.py
: Generates ride events in Avro formatride_event_json.py
: Generates ride events in JSON formatride_summary_producer.py
: Produces ride summary events
-
Apache Spark Processing
- Uses Spark Structured Streaming to consume Kafka topics
- Processes and transforms ride event data
- Writes data to Delta Lake tables
-
Delta Lake Storage
- Provides ACID transactions and versioning
- Enables time travel capabilities
- Stores processed ride data in an optimized format
-
Streaming to Apache Pinot (OLAP)
- Stream from Kafka to Apache Pinot for OLAP reasons.
-
Data Visualization to Apache Superset
- Apache Pinot to Superset, (I am big into analytics also)
-
Orchestrate with Airflow
- Apache Airflow to schedule and orchestrate Kafka
- Schedule writing stream to delta lake.
-- Setup Overview:
Services are setup in docker-compose.yml
edit for your choice, specifically apache-superset
-
Make sure docker is running, and on your terminal, run:
docker build -t custom-airflow . docker compose up -d docker exec airflow-cli users --username USERNAME --firstname FIRST_NAME --lastname LASTNAME --email airflow@email.com --role Admin --password ******
-
Run your airflow DAGS on
localhost:8080
and trigger it:kafka_producer_dag
: Using Python Operatorkafka_to_delta
: Using Bash Operator for Spark engine
-
Set up your Apache Pinot on
localhost:9000
, open SWAGGER API or use docker cli to addAddSchema
andAddTable
, you would findschema.json
andtable.json
inpinot_setup
folder. -
Set up your Apache Superset, firstly you would want the Apache Pinot Driver installed, check here, open on
localhost:8088
add Pinot configurations and you are good to go with visualizations.
-- Feel free to customize the docker file and compose.yml on how you would want it to run, also note Apache Superset configuration, check before spinning up the docker.
- ride_id (string)
- city (string)
- ride_type (string)
- start_time (timestamp)
- end_time (timestamp)
- duration_minutes (integer)
- distance_miles (float)
- fare_usd (float)
- driver_id (string)
- rider_id (string)
The project is organized into a directory structure that facilitates the management and execution of data processing workflows using Apache Airflow. Below is an overview of the key directories and their purposes:
airflow
└── dags
├── bronze
├── packages
├── random_delta
└── spark_delta
-
dags: This directory contains Directed Acyclic Graphs (DAGs) that define the workflows for data processing. Each DAG is a collection of tasks that are executed in a specific order.
-
bronze: This folder is intended for the initial stage of data processing, often referred to as the "bronze" layer. It typically contains raw data ingested from various sources before any transformations are applied.
-
packages: This directory contains the code for producing and consuming Kafka messages. It supports both Avro and JSON formats, allowing for flexible data interchange between different components of the pipeline. The files in this directory handle the serialization and deserialization of data, ensuring that it can be processed correctly by downstream systems.
-
random_delta: This folder contains local code that facilitates the transfer of data from Kafka to Delta Lake. It includes scripts and utilities that manage the ingestion and transformation of data, ensuring that it is stored in an optimized format for querying and analysis.
-
spark_delta: This directory contains the
kafka_to_delta.py
script, which is responsible for running the Spark job that processes data from Kafka and writes it to Delta Lake. This script leverages Spark's structured streaming capabilities to handle real-time data processing, ensuring that data is continuously ingested and made available for analytics.
-
Monitor Kafka topics using:
kafka-console-consumer.sh --topic ride-events --from-beginning
-
Check Delta Lake tables using Spark SQL
-
View processing metrics in Spark UI
- Fork the repository
- Create a feature branch
- Submit a pull request
- Testing Coverage
- Monitoring through Grafana, DataDog
- Data Quality Checks
- More interesting data, more kafka topics events
I think to make this production ready, things like governance, security, cloud deployment and CI/CD would be put in place, this was just a fun side project. That I would also test out using different technologies such as Prefect, Dagster, Flink, S3 or Minio, and Terraform, and spice it up with monitoring and quality checks next time, and some other cool things. Important note is learn more about event processing and schema evolution, which Delta Lake solves with autoMerge
the goal was to reduce my skill issues, mission accomplished.