CodeCutTech
diff --git a/‎Chapter5/spark.ipynb
Lines changed: 91 additions & 55 deletions b/‎Chapter5/spark.ipynb
Lines changed: 91 additions & 55 deletions
@@ -1099,7 +1099,7 @@
    "id": "a2d96783",
    "metadata": {},
    "source": [
-    "### PySpark SQL: Enhancing Reusability with Parameterized Queries"
+    "### Writing Safer and Cleaner Spark SQL with PySpark's Parameterized Queries"
    ]
   },
   {
@@ -1117,120 +1117,156 @@
    ]
   },
   {
-   "cell_type": "markdown",
-   "id": "0ddc2bc2",
+   "cell_type": "code",
+   "execution_count": 12,
+   "id": "8056b8af",
    "metadata": {},
+   "outputs": [],
    "source": [
-    "In PySpark, parametrized queries enable the same query structure to be reused with different inputs, without rewriting the SQL.\n",
+    "from pyspark.sql import SparkSession\n",
+    "import pandas as pd\n",
+    "from datetime import date, timedelta\n",
     "\n",
-    "Additionally, they safeguard against SQL injection attacks by treating input data as parameters rather than as executable code."
+    "spark = SparkSession.builder.getOrCreate()"
    ]
   },
   {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "8056b8af",
+   "cell_type": "markdown",
+   "id": "0ddc2bc2",
    "metadata": {},
-   "outputs": [],
    "source": [
-    "from pyspark.sql import SparkSession\n",
-    "import pandas as pd \n",
+    "When working with Spark SQL queries, using regular Python string interpolation can lead to security vulnerabilities and require extra steps like creating temporary views. PySpark offers a better solution with parameterized queries, which:\n",
     "\n",
-    "spark = SparkSession.builder.getOrCreate()"
+    "- Protect against SQL injection\n",
+    "- Allow using DataFrame objects directly in queries\n",
+    "- Automatically handle date formatting\n",
+    "- Provide a more expressive way to write SQL queries\n",
+    "\n",
+    "Let's compare the traditional approach with parameterized queries:"
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": 3,
+   "execution_count": 13,
    "id": "cc5f3c19",
    "metadata": {},
    "outputs": [
-    {
-     "name": "stderr",
-     "output_type": "stream",
-     "text": [
-      "                                                                                \r"
-     ]
-    },
     {
      "name": "stdout",
      "output_type": "stream",
      "text": [
-      "+-------+-----+\n",
-      "|item_id|price|\n",
-      "+-------+-----+\n",
-      "|      1|    4|\n",
-      "|      2|    2|\n",
-      "|      3|    5|\n",
-      "|      4|    1|\n",
-      "+-------+-----+\n",
+      "+-------+-----+----------------+\n",
+      "|item_id|price|transaction_date|\n",
+      "+-------+-----+----------------+\n",
+      "|      1|    4|      2023-01-15|\n",
+      "|      2|    2|      2023-02-01|\n",
+      "|      3|    5|      2023-03-10|\n",
+      "|      4|    1|      2023-04-22|\n",
+      "+-------+-----+----------------+\n",
       "\n"
      ]
     }
    ],
    "source": [
     "# Create a Spark DataFrame\n",
-    "item_price_pandas = pd.DataFrame({\"item_id\": [1, 2, 3, 4], \"price\": [4, 2, 5, 1]})\n",
+    "item_price_pandas = pd.DataFrame({\n",
+    "    \"item_id\": [1, 2, 3, 4],\n",
+    "    \"price\": [4, 2, 5, 1],\n",
+    "    \"transaction_date\": [\n",
+    "        date(2023, 1, 15),\n",
+    "        date(2023, 2, 1),\n",
+    "        date(2023, 3, 10),\n",
+    "        date(2023, 4, 22)\n",
+    "    ]\n",
+    "})\n",
+    "\n",
     "item_price = spark.createDataFrame(item_price_pandas)\n",
     "item_price.show()"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "id": "fcfcc76a-5b3e-41b3-819f-14adf8576061",
+   "metadata": {},
+   "source": [
+    "Traditional approach (less secure, requires temp view and wrapping the date in quotes):"
+   ]
+  },
   {
    "cell_type": "code",
-   "execution_count": 16,
-   "id": "90976e5b",
+   "execution_count": 19,
+   "id": "451c6d69-8f0d-4b5f-a030-873ed6c5295e",
    "metadata": {},
    "outputs": [
-    {
-     "name": "stderr",
-     "output_type": "stream",
-     "text": [
-      "                                                                                \r"
-     ]
-    },
     {
      "name": "stdout",
      "output_type": "stream",
      "text": [
-      "+-------+-----+\n",
-      "|item_id|price|\n",
-      "+-------+-----+\n",
-      "|      1|    4|\n",
-      "+-------+-----+\n",
+      "+-------+-----+----------------+\n",
+      "|item_id|price|transaction_date|\n",
+      "+-------+-----+----------------+\n",
+      "|      3|    5|      2023-03-10|\n",
+      "|      4|    1|      2023-04-22|\n",
+      "+-------+-----+----------------+\n",
       "\n"
      ]
     }
    ],
    "source": [
-    "query = \"\"\"SELECT item_id, price \n",
-    "FROM {item_price} \n",
-    "WHERE item_id = {id_val} \n",
+    "item_price.createOrReplaceTempView(\"item_price_view\")\n",
+    "transaction_date = \"2023-02-15\"\n",
+    "\n",
+    "query = f\"\"\"SELECT *\n",
+    "FROM item_price_view \n",
+    "WHERE transaction_date > '{transaction_date}'\n",
     "\"\"\"\n",
     "\n",
-    "spark.sql(query, id_val=1, item_price=item_price).show()"
+    "spark.sql(query).show()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "d92eecf2-e753-4d4c-8122-713aa160fd98",
+   "metadata": {},
+   "source": [
+    "PySpark's parameterized query approach (secure, no temp view and quotes needed):"
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": 17,
-   "id": "44634ce8",
+   "execution_count": 20,
+   "id": "90976e5b",
    "metadata": {},
    "outputs": [
     {
      "name": "stdout",
      "output_type": "stream",
      "text": [
-      "+-------+-----+\n",
-      "|item_id|price|\n",
-      "+-------+-----+\n",
-      "|      2|    2|\n",
-      "+-------+-----+\n",
+      "+-------+-----+----------------+\n",
+      "|item_id|price|transaction_date|\n",
+      "+-------+-----+----------------+\n",
+      "|      3|    5|      2023-03-10|\n",
+      "|      4|    1|      2023-04-22|\n",
+      "+-------+-----+----------------+\n",
       "\n"
      ]
     }
    ],
    "source": [
-    "spark.sql(query, id_val=2, item_price=item_price).show()"
+    "query = \"\"\"SELECT *\n",
+    "FROM {item_price} \n",
+    "WHERE transaction_date > {transaction_date}\n",
+    "\"\"\"\n",
+    "\n",
+    "spark.sql(query, item_price=item_price, transaction_date=transaction_date).show()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "86a79ac8-70d0-458d-a4ac-4d32e897d5d2",
+   "metadata": {},
+   "source": [
+    "This method allows for easy parameter substitution and direct use of DataFrames, making your Spark SQL queries both safer and more convenient to write and maintain."
    ]
   },
   {