Skip to content

Commit 6c98a44

Browse files
Update README.md
1 parent 26c83e5 commit 6c98a44

File tree

1 file changed

+12
-12
lines changed

1 file changed

+12
-12
lines changed

README.md

Lines changed: 12 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,11 @@
1-
# portable-etl
2-
![CI](https://github.com/syedhassaanahmed/portable-etl/actions/workflows/ci.yml/badge.svg)
1+
# spark-with-engineering-fundamentals
2+
![CI](https://github.com/syedhassaanahmed/spark-with-engineering-fundamentals/actions/workflows/ci.yml/badge.svg)
33

44
## Why?
5-
Workload portability is important to manufacturing customers, as it allows them to operate solutions across different environments without the need to re-architect or re-write large sections of code. They can easily move from the Edge to the cloud, depending on their specific requirements. It also enables them to analyze and make real-time decisions at the source of the data and reduces their dependency on a central location for data processing. [Apache Spark](https://spark.apache.org/docs/latest/)'s rich ecosystem of data connectors, availability in [the cloud](https://azure.microsoft.com/en-us/products/databricks) and the Edge ([Docker](https://hub.docker.com/r/apache/spark-py) & [Kubernetes](https://spark.apache.org/docs/latest/running-on-kubernetes.html)), and a [thriving open source community](https://github.com/apache/spark) makes it an ideal candidate for portable ETL workloads.
5+
[Apache Spark](https://spark.apache.org/docs/latest/) is a very popular open-source analytics engine for large-scale data processing. When building Spark applications as part of a production-grade solution, developers need to take care of engineering aspects such as inner dev loop, testing, CI/CD, infra-as-code and observability.
66

77
## What?
8-
In this sample we'll showcase an E2E data pipeline leveraging Spark's data processing capabilities.
8+
In this **work-in-progress** sample we'll demonstrate an E2E Spark data pipeline and how to tackle the above-mentioned engineering fundamentals.
99

1010
## How?
1111
### Cloud
@@ -20,29 +20,29 @@ The ETL workload is represented in a [Databricks Job](https://learn.microsoft.co
2020
<img src="./docs/cloud-architecture.png">
2121
</div>
2222

23-
### Edge
24-
In the Edge version, we provision and orchestrate everything with [Docker Compose](https://docs.docker.com/compose/).
23+
### Local
24+
In the local version, we provision and orchestrate everything with [Docker Compose](https://docs.docker.com/compose/).
2525

2626
**Note:** Please use the `docker compose` tool instead of the [older version](https://stackoverflow.com/a/66516826) `docker-compose`.
2727

28-
The pipeline begins with [Azure IoT Device Telemetry Simulator](https://github.com/Azure-Samples/Iot-Telemetry-Simulator) sending synthetic Time Series data to a [Confluent Community Kafka Server](https://docs.confluent.io/platform/current/platform-quickstart.html#ce-docker-quickstart). A PySpark app then processes the Time Series, applies some metadata and writes the enriched results to a SQL DB hosted in [SQL Server 2022 Linux container](https://learn.microsoft.com/en-us/sql/linux/quickstart-install-connect-docker?view=sql-server-ver16&pivots=cs1-bash). The key point to note here is that the data processing logic is shared between the Cloud and Edge through the `common_lib` Wheel.
28+
The pipeline begins with [Azure IoT Device Telemetry Simulator](https://github.com/Azure-Samples/Iot-Telemetry-Simulator) sending synthetic Time Series data to a [Confluent Community Kafka Server](https://docs.confluent.io/platform/current/platform-quickstart.html#ce-docker-quickstart). A PySpark app then processes the Time Series, applies some metadata and writes the enriched results to a SQL DB hosted in [SQL Server 2022 Linux container](https://learn.microsoft.com/en-us/sql/linux/quickstart-install-connect-docker?view=sql-server-ver16&pivots=cs1-bash). The key point to note here is that the data processing logic is shared between the cloud and local versions through the `common_lib` Wheel.
2929

3030
<div align="center">
31-
<img src="./docs/edge-architecture.png">
31+
<img src="./docs/local-architecture.png">
3232
</div>
3333

3434
## NFRs
3535

3636
### Tests
37-
- To validate that the E2E Edge pipeline is working correctly, we can execute the script `smoke-test.sh`. This script will send messages using the IoT Telemetry Simulator and then query the SQL DB to ensure the messages were processed correctly.
37+
- To validate that the local E2E pipeline is working correctly, we can execute the script `smoke-test.sh`. This script will send messages using the IoT Telemetry Simulator and then query the SQL DB to ensure the messages were processed correctly.
3838
- Unit tests are available for the `common_lib` Wheel in PyTest.
3939
- Both type of tests are also executed in the CI pipeline.
4040

4141
### Observability
42-
The Edge version of the solution also deploys additional containers for [Prometheus](https://prometheus.io/) and [Grafana](https://grafana.com/). The Grafana dashboard below, relies on the [Spark 3.0 metrics](https://spark.apache.org/docs/3.0.0/monitoring.html) emitted in the Prometheus format.
42+
The local version of the solution also deploys additional containers for [Prometheus](https://prometheus.io/) and [Grafana](https://grafana.com/). The Grafana dashboard below, relies on the [Spark 3.0 metrics](https://spark.apache.org/docs/3.0.0/monitoring.html) emitted in the Prometheus format.
4343

4444
<div align="center">
45-
<img src="./docs/edge-grafana.png">
45+
<img src="./docs/local-grafana.png">
4646
</div>
4747

4848
### Inner Dev Loop
@@ -51,4 +51,4 @@ The Edge version of the solution also deploys additional containers for [Prometh
5151
## Team
5252
- [Alexander Gassmann](https://github.com/Salazander)
5353
- [Magda Baran](https://github.com/MagdaPaj)
54-
- [Hassaan Ahmed](https://github.com/syedhassaanahmed)
54+
- [Hassaan Ahmed](https://github.com/syedhassaanahmed)

0 commit comments

Comments
 (0)