Update README.md

syedhassaanahmed · web-flow · commit 6c98a441e577 · 2024-10-31T10:59:35.000+01:00
diff --git a/README.md b/README.md
@@ -1,11 +1,11 @@
-# portable-etl
-![CI](https://github.com/syedhassaanahmed/portable-etl/actions/workflows/ci.yml/badge.svg)
+# spark-with-engineering-fundamentals
+![CI](https://github.com/syedhassaanahmed/spark-with-engineering-fundamentals/actions/workflows/ci.yml/badge.svg)
 
 ## Why?
-Workload portability is important to manufacturing customers, as it allows them to operate solutions across different environments without the need to re-architect or re-write large sections of code. They can easily move from the Edge to the cloud, depending on their specific requirements. It also enables them to analyze and make real-time decisions at the source of the data and reduces their dependency on a central location for data processing. [Apache Spark](https://spark.apache.org/docs/latest/)'s rich ecosystem of data connectors, availability in [the cloud](https://azure.microsoft.com/en-us/products/databricks) and the Edge ([Docker](https://hub.docker.com/r/apache/spark-py) & [Kubernetes](https://spark.apache.org/docs/latest/running-on-kubernetes.html)), and a [thriving open source community](https://github.com/apache/spark) makes it an ideal candidate for portable ETL workloads. 
+[Apache Spark](https://spark.apache.org/docs/latest/) is a very popular open-source analytics engine for large-scale data processing. When building Spark applications as part of a production-grade solution, developers need to take care of engineering aspects such as inner dev loop, testing, CI/CD, infra-as-code and observability.
 
 ## What?
-In this sample we'll showcase an E2E data pipeline leveraging Spark's data processing capabilities.
+In this **work-in-progress** sample we'll demonstrate an E2E Spark data pipeline and how to tackle the above-mentioned engineering fundamentals.
 
 ## How?
 ### Cloud
@@ -20,29 +20,29 @@ The ETL workload is represented in a [Databricks Job](https://learn.microsoft.co
     <img src="./docs/cloud-architecture.png">
 </div>
 
-### Edge
-In the Edge version, we provision and orchestrate everything with [Docker Compose](https://docs.docker.com/compose/).
+### Local
+In the local version, we provision and orchestrate everything with [Docker Compose](https://docs.docker.com/compose/).
 
 **Note:** Please use the `docker compose` tool instead of the [older version](https://stackoverflow.com/a/66516826) `docker-compose`.
 
-The pipeline begins with [Azure IoT Device Telemetry Simulator](https://github.com/Azure-Samples/Iot-Telemetry-Simulator) sending synthetic Time Series data to a [Confluent Community Kafka Server](https://docs.confluent.io/platform/current/platform-quickstart.html#ce-docker-quickstart). A PySpark app then processes the Time Series, applies some metadata and writes the enriched results to a SQL DB hosted in [SQL Server 2022 Linux container](https://learn.microsoft.com/en-us/sql/linux/quickstart-install-connect-docker?view=sql-server-ver16&pivots=cs1-bash). The key point to note here is that the data processing logic is shared between the Cloud and Edge through the `common_lib` Wheel.
+The pipeline begins with [Azure IoT Device Telemetry Simulator](https://github.com/Azure-Samples/Iot-Telemetry-Simulator) sending synthetic Time Series data to a [Confluent Community Kafka Server](https://docs.confluent.io/platform/current/platform-quickstart.html#ce-docker-quickstart). A PySpark app then processes the Time Series, applies some metadata and writes the enriched results to a SQL DB hosted in [SQL Server 2022 Linux container](https://learn.microsoft.com/en-us/sql/linux/quickstart-install-connect-docker?view=sql-server-ver16&pivots=cs1-bash). The key point to note here is that the data processing logic is shared between the cloud and local versions through the `common_lib` Wheel.
 
 <div align="center">
-    <img src="./docs/edge-architecture.png">
+    <img src="./docs/local-architecture.png">
 </div>
 
 ## NFRs
 
 ### Tests
-- To validate that the E2E Edge pipeline is working correctly, we can execute the script `smoke-test.sh`. This script will send messages using the IoT Telemetry Simulator and then query the SQL DB to ensure the messages were processed correctly.
+- To validate that the local E2E pipeline is working correctly, we can execute the script `smoke-test.sh`. This script will send messages using the IoT Telemetry Simulator and then query the SQL DB to ensure the messages were processed correctly.
 - Unit tests are available for the `common_lib` Wheel in PyTest.
 - Both type of tests are also executed in the CI pipeline.
 
 ### Observability
-The Edge version of the solution also deploys additional containers for [Prometheus](https://prometheus.io/) and [Grafana](https://grafana.com/). The Grafana dashboard below, relies on the [Spark 3.0 metrics](https://spark.apache.org/docs/3.0.0/monitoring.html) emitted in the Prometheus format.
+The local version of the solution also deploys additional containers for [Prometheus](https://prometheus.io/) and [Grafana](https://grafana.com/). The Grafana dashboard below, relies on the [Spark 3.0 metrics](https://spark.apache.org/docs/3.0.0/monitoring.html) emitted in the Prometheus format.
 
 <div align="center">
-    <img src="./docs/edge-grafana.png">
+    <img src="./docs/local-grafana.png">
 </div>
 
 ### Inner Dev Loop
@@ -51,4 +51,4 @@ The Edge version of the solution also deploys additional containers for [Prometh
 ## Team
 - [Alexander Gassmann](https://github.com/Salazander)
 - [Magda Baran](https://github.com/MagdaPaj)
-- [Hassaan Ahmed](https://github.com/syedhassaanahmed)
+- [Hassaan Ahmed](https://github.com/syedhassaanahmed)