Real-Time Ride Events Processing Pipeline

Took on this project because I was reading about Uber and decided to try out a ride sharing event pipeline, assuming I was to build one for my startup, this project implements a real-time data pipeline for processing ride-sharing events using Apache Kafka, Apache Spark, Apache Pinot (for streaming Analytics), Apache Superset for visualizations and Delta Lake, orchestrated with Apache Airflow. (Needs Improvement though).

Architecture Overview

The pipeline consists of the following components, check assets/data-pipeline-graph.md

A brief:

Kafka Producers
- ride_event_producer.py: Generates ride events in Avro format
- ride_event_json.py: Generates ride events in JSON format
- ride_summary_producer.py: Produces ride summary events
Apache Spark Processing
- Uses Spark Structured Streaming to consume Kafka topics
- Processes and transforms ride event data
- Writes data to Delta Lake tables
Delta Lake Storage
- Provides ACID transactions and versioning
- Enables time travel capabilities
- Stores processed ride data in an optimized format
Streaming to Apache Pinot (OLAP)
- Stream from Kafka to Apache Pinot for OLAP reasons.
Data Visualization to Apache Superset
- Apache Pinot to Superset, (I am big into analytics also)
Orchestrate with Airflow
- Apache Airflow to schedule and orchestrate Kafka
- Schedule writing stream to delta lake.

-- Setup Overview:

Services are setup in docker-compose.yml edit for your choice, specifically apache-superset

Setup Instructions

Make sure docker is running, and on your terminal, run:

docker build -t custom-airflow .
docker compose up -d 
docker exec airflow-cli users --username USERNAME --firstname FIRST_NAME --lastname LASTNAME --email airflow@email.com --role Admin --password ******

Run your airflow DAGS on localhost:8080 and trigger it:
- kafka_producer_dag : Using Python Operator
- kafka_to_delta: Using Bash Operator for Spark engine
Set up your Apache Pinot on localhost:9000, open SWAGGER API or use docker cli to add AddSchema and AddTable, you would find schema.json and table.json in pinot_setup folder.
Set up your Apache Superset, firstly you would want the Apache Pinot Driver installed, check here, open on localhost:8088 add Pinot configurations and you are good to go with visualizations.

-- Feel free to customize the docker file and compose.yml on how you would want it to run, also note Apache Superset configuration, check before spinning up the docker.

Data Schema

Ride Events

ride_id (string)
city (string)
ride_type (string)
start_time (timestamp)
end_time (timestamp)
duration_minutes (integer)
distance_miles (float)
fare_usd (float)
driver_id (string)
rider_id (string)

Project Structure Overview

The project is organized into a directory structure that facilitates the management and execution of data processing workflows using Apache Airflow. Below is an overview of the key directories and their purposes:

airflow
└── dags
    ├── bronze
    ├── packages
    ├── random_delta
    └── spark_delta

Directory Descriptions of Airflow

dags: This directory contains Directed Acyclic Graphs (DAGs) that define the workflows for data processing. Each DAG is a collection of tasks that are executed in a specific order.
bronze: This folder is intended for the initial stage of data processing, often referred to as the "bronze" layer. It typically contains raw data ingested from various sources before any transformations are applied.
packages: This directory contains the code for producing and consuming Kafka messages. It supports both Avro and JSON formats, allowing for flexible data interchange between different components of the pipeline. The files in this directory handle the serialization and deserialization of data, ensuring that it can be processed correctly by downstream systems.
random_delta: This folder contains local code that facilitates the transfer of data from Kafka to Delta Lake. It includes scripts and utilities that manage the ingestion and transformation of data, ensuring that it is stored in an optimized format for querying and analysis.
spark_delta: This directory contains the kafka_to_delta.py script, which is responsible for running the Spark job that processes data from Kafka and writes it to Delta Lake. This script leverages Spark's structured streaming capabilities to handle real-time data processing, ensuring that data is continuously ingested and made available for analytics.

Monitoring

Monitor Kafka topics using:

kafka-console-consumer.sh --topic ride-events --from-beginning

Check Delta Lake tables using Spark SQL
View processing metrics in Spark UI

Contributing

Fork the repository
Create a feature branch
Submit a pull request

Things that can be improved on

Testing Coverage
Monitoring through Grafana, DataDog
Data Quality Checks
More interesting data, more kafka topics events

Authors Note:

I think to make this production ready, things like governance, security, cloud deployment and CI/CD would be put in place, this was just a fun side project. That I would also test out using different technologies such as Prefect, Dagster, Flink, S3 or Minio, and Terraform, and spice it up with monitoring and quality checks next time, and some other cool things. Important note is learn more about event processing and schema evolution, which Delta Lake solves with ~~autoMerge~~ the goal was to reduce my skill issues, mission accomplished.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
airflow		airflow
assets		assets
pinot_setup		pinot_setup
superset		superset
.dockerignore		.dockerignore
.gitignore		.gitignore
Readme.md		Readme.md
custom_entrypoint.sh		custom_entrypoint.sh
docker-compose.yml		docker-compose.yml
dockerfile		dockerfile

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Real-Time Ride Events Processing Pipeline

Architecture Overview

Setup Instructions

Data Schema

Ride Events

Project Structure Overview

Directory Descriptions of Airflow

Monitoring

Contributing

Things that can be improved on

Authors Note:

About

Uh oh!

Releases

Packages

Uh oh!

Languages

darknuma/ride-sharing-pipeline

Folders and files

Latest commit

History

Repository files navigation

Real-Time Ride Events Processing Pipeline

Architecture Overview

Setup Instructions

Data Schema

Ride Events

Project Structure Overview

Directory Descriptions of Airflow

Monitoring

Contributing

Things that can be improved on

Authors Note:

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages