Skip to content

Commit f7e44a3

Browse files
add unit testing with pyspark
1 parent 85446bd commit f7e44a3

File tree

4 files changed

+282
-1
lines changed

4 files changed

+282
-1
lines changed

Chapter5/spark.ipynb

Lines changed: 106 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1353,6 +1353,112 @@
13531353
"result1.show()\n",
13541354
"result2.show()"
13551355
]
1356+
},
1357+
{
1358+
"cell_type": "markdown",
1359+
"id": "9da7e800",
1360+
"metadata": {},
1361+
"source": [
1362+
"### Simplify Unit Testing of SQL Queries with PySpark"
1363+
]
1364+
},
1365+
{
1366+
"cell_type": "code",
1367+
"execution_count": null,
1368+
"id": "1a1f400b",
1369+
"metadata": {
1370+
"tags": [
1371+
"hide-cell"
1372+
]
1373+
},
1374+
"outputs": [],
1375+
"source": [
1376+
"!pip install ipytest \"pyspark[sql]\"\n"
1377+
]
1378+
},
1379+
{
1380+
"cell_type": "code",
1381+
"execution_count": 51,
1382+
"id": "e1bcfd44",
1383+
"metadata": {
1384+
"tags": [
1385+
"remove-cell"
1386+
]
1387+
},
1388+
"outputs": [],
1389+
"source": [
1390+
"import ipytest\n",
1391+
"ipytest.autoconfig()"
1392+
]
1393+
},
1394+
{
1395+
"cell_type": "markdown",
1396+
"id": "954ea695",
1397+
"metadata": {},
1398+
"source": [
1399+
"Testing your SQL queries helps to ensure that they are correct and functioning as intended.\n",
1400+
"\n",
1401+
"PySpark enables users to parameterize queries, which simplifies unit testing of SQL queries. In this example, the `df` and `amount` variables are parameterized to verify whether the `actual_df` matches the `expected_df`."
1402+
]
1403+
},
1404+
{
1405+
"cell_type": "code",
1406+
"execution_count": 54,
1407+
"id": "14d313f8",
1408+
"metadata": {},
1409+
"outputs": [
1410+
{
1411+
"name": "stderr",
1412+
"output_type": "stream",
1413+
"text": [
1414+
" \r"
1415+
]
1416+
},
1417+
{
1418+
"name": "stdout",
1419+
"output_type": "stream",
1420+
"text": [
1421+
"\u001b[32m.\u001b[0m\u001b[32m [100%]\u001b[0m\n"
1422+
]
1423+
}
1424+
],
1425+
"source": [
1426+
"%%ipytest -qq\n",
1427+
"import pytest\n",
1428+
"from pyspark.testing import assertDataFrameEqual\n",
1429+
"\n",
1430+
"\n",
1431+
"@pytest.fixture\n",
1432+
"def query():\n",
1433+
" return \"SELECT * from {df} where price > {amount} AND name LIKE '%Product%';\"\n",
1434+
"\n",
1435+
"\n",
1436+
"def test_query_return_correct_number_of_rows(query):\n",
1437+
"\n",
1438+
" spark = SparkSession.builder.getOrCreate()\n",
1439+
"\n",
1440+
" # Create a sample DataFrame\n",
1441+
" df = spark.createDataFrame(\n",
1442+
" [\n",
1443+
" (\"Product 1\", 10.0, 5),\n",
1444+
" (\"Product 2\", 15.0, 3),\n",
1445+
" (\"Product 3\", 8.0, 2),\n",
1446+
" ],\n",
1447+
" [\"name\", \"price\", \"quantity\"],\n",
1448+
" )\n",
1449+
"\n",
1450+
" # Execute the query\n",
1451+
" actual_df = spark.sql(query, df=df, amount=10)\n",
1452+
"\n",
1453+
" # Assert the result\n",
1454+
" expected_df = spark.createDataFrame(\n",
1455+
" [\n",
1456+
" (\"Product 2\", 15.0, 3),\n",
1457+
" ],\n",
1458+
" [\"name\", \"price\", \"quantity\"],\n",
1459+
" )\n",
1460+
" assertDataFrameEqual(actual_df, expected_df)"
1461+
]
13561462
}
13571463
],
13581464
"metadata": {

docs/Chapter5/spark.html

Lines changed: 69 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -518,6 +518,7 @@ <h2> Contents </h2>
518518
<li class="toc-h2 nav-item toc-entry"><a class="reference internal nav-link" href="#working-with-arrays-made-easier-in-spark-3-5">6.15.4. Working with Arrays Made Easier in Spark 3.5</a></li>
519519
<li class="toc-h2 nav-item toc-entry"><a class="reference internal nav-link" href="#simplify-complex-sql-queries-with-pyspark-udfs">6.15.5. Simplify Complex SQL Queries with PySpark UDFs</a></li>
520520
<li class="toc-h2 nav-item toc-entry"><a class="reference internal nav-link" href="#leverage-spark-udfs-for-reusable-complex-logic-in-sql-queries">6.15.6. Leverage Spark UDFs for Reusable Complex Logic in SQL Queries</a></li>
521+
<li class="toc-h2 nav-item toc-entry"><a class="reference internal nav-link" href="#simplify-unit-testing-of-sql-queries-with-pyspark">6.15.7. Simplify Unit Testing of SQL Queries with PySpark</a></li>
521522
</ul>
522523
</nav>
523524
</div>
@@ -1452,6 +1453,73 @@ <h2><span class="section-number">6.15.6. </span>Leverage Spark UDFs for Reusable
14521453
</div>
14531454
</div>
14541455
</section>
1456+
<section id="simplify-unit-testing-of-sql-queries-with-pyspark">
1457+
<h2><span class="section-number">6.15.7. </span>Simplify Unit Testing of SQL Queries with PySpark<a class="headerlink" href="#simplify-unit-testing-of-sql-queries-with-pyspark" title="Permalink to this heading">#</a></h2>
1458+
<div class="cell tag_hide-cell docutils container">
1459+
<details class="hide above-input">
1460+
<summary aria-label="Toggle hidden content">
1461+
<span class="collapsed">Show code cell content</span>
1462+
<span class="expanded">Hide code cell content</span>
1463+
</summary>
1464+
<div class="cell_input docutils container">
1465+
<div class="highlight-ipython3 notranslate"><div class="highlight"><pre><span></span><span class="o">!</span>pip<span class="w"> </span>install<span class="w"> </span>ipytest<span class="w"> </span><span class="s2">&quot;pyspark[sql]&quot;</span>
1466+
</pre></div>
1467+
</div>
1468+
</div>
1469+
</details>
1470+
</div>
1471+
<p>Testing your SQL queries helps to ensure that they are correct and functioning as intended.</p>
1472+
<p>PySpark enables users to parameterize queries, which simplifies unit testing of SQL queries. In this example, the <code class="docutils literal notranslate"><span class="pre">df</span></code> and <code class="docutils literal notranslate"><span class="pre">amount</span></code> variables are parameterized to verify whether the <code class="docutils literal notranslate"><span class="pre">actual_df</span></code> matches the <code class="docutils literal notranslate"><span class="pre">expected_df</span></code>.</p>
1473+
<div class="cell docutils container">
1474+
<div class="cell_input docutils container">
1475+
<div class="highlight-ipython3 notranslate"><div class="highlight"><pre><span></span><span class="o">%%</span><span class="k">ipytest</span> -qq
1476+
import pytest
1477+
from pyspark.testing import assertDataFrameEqual
1478+
1479+
1480+
@pytest.fixture
1481+
def query():
1482+
return &quot;SELECT * from {df} where price &gt; {amount} AND name LIKE &#39;%Product%&#39;;&quot;
1483+
1484+
1485+
def test_query_return_correct_number_of_rows(query):
1486+
1487+
spark = SparkSession.builder.getOrCreate()
1488+
1489+
# Create a sample DataFrame
1490+
df = spark.createDataFrame(
1491+
[
1492+
(&quot;Product 1&quot;, 10.0, 5),
1493+
(&quot;Product 2&quot;, 15.0, 3),
1494+
(&quot;Product 3&quot;, 8.0, 2),
1495+
],
1496+
[&quot;name&quot;, &quot;price&quot;, &quot;quantity&quot;],
1497+
)
1498+
1499+
# Execute the query
1500+
actual_df = spark.sql(query, df=df, amount=10)
1501+
1502+
# Assert the result
1503+
expected_df = spark.createDataFrame(
1504+
[
1505+
(&quot;Product 2&quot;, 15.0, 3),
1506+
],
1507+
[&quot;name&quot;, &quot;price&quot;, &quot;quantity&quot;],
1508+
)
1509+
assertDataFrameEqual(actual_df, expected_df)
1510+
</pre></div>
1511+
</div>
1512+
</div>
1513+
<div class="cell_output docutils container">
1514+
<div class="output stderr highlight-myst-ansi notranslate"><div class="highlight"><pre><span></span>
1515+
</pre></div>
1516+
</div>
1517+
<div class="output stream highlight-myst-ansi notranslate"><div class="highlight"><pre><span></span><span class=" -Color -Color-Green">. [100%]</span>
1518+
</pre></div>
1519+
</div>
1520+
</div>
1521+
</div>
1522+
</section>
14551523
</section>
14561524

14571525
<script type="text/x-thebe-config">
@@ -1523,6 +1591,7 @@ <h2><span class="section-number">6.15.6. </span>Leverage Spark UDFs for Reusable
15231591
<li class="toc-h2 nav-item toc-entry"><a class="reference internal nav-link" href="#working-with-arrays-made-easier-in-spark-3-5">6.15.4. Working with Arrays Made Easier in Spark 3.5</a></li>
15241592
<li class="toc-h2 nav-item toc-entry"><a class="reference internal nav-link" href="#simplify-complex-sql-queries-with-pyspark-udfs">6.15.5. Simplify Complex SQL Queries with PySpark UDFs</a></li>
15251593
<li class="toc-h2 nav-item toc-entry"><a class="reference internal nav-link" href="#leverage-spark-udfs-for-reusable-complex-logic-in-sql-queries">6.15.6. Leverage Spark UDFs for Reusable Complex Logic in SQL Queries</a></li>
1594+
<li class="toc-h2 nav-item toc-entry"><a class="reference internal nav-link" href="#simplify-unit-testing-of-sql-queries-with-pyspark">6.15.7. Simplify Unit Testing of SQL Queries with PySpark</a></li>
15261595
</ul>
15271596
</nav></div>
15281597

docs/_sources/Chapter5/spark.ipynb

Lines changed: 106 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1353,6 +1353,112 @@
13531353
"result1.show()\n",
13541354
"result2.show()"
13551355
]
1356+
},
1357+
{
1358+
"cell_type": "markdown",
1359+
"id": "9da7e800",
1360+
"metadata": {},
1361+
"source": [
1362+
"### Simplify Unit Testing of SQL Queries with PySpark"
1363+
]
1364+
},
1365+
{
1366+
"cell_type": "code",
1367+
"execution_count": null,
1368+
"id": "1a1f400b",
1369+
"metadata": {
1370+
"tags": [
1371+
"hide-cell"
1372+
]
1373+
},
1374+
"outputs": [],
1375+
"source": [
1376+
"!pip install ipytest \"pyspark[sql]\"\n"
1377+
]
1378+
},
1379+
{
1380+
"cell_type": "code",
1381+
"execution_count": 51,
1382+
"id": "e1bcfd44",
1383+
"metadata": {
1384+
"tags": [
1385+
"remove-cell"
1386+
]
1387+
},
1388+
"outputs": [],
1389+
"source": [
1390+
"import ipytest\n",
1391+
"ipytest.autoconfig()"
1392+
]
1393+
},
1394+
{
1395+
"cell_type": "markdown",
1396+
"id": "954ea695",
1397+
"metadata": {},
1398+
"source": [
1399+
"Testing your SQL queries helps to ensure that they are correct and functioning as intended.\n",
1400+
"\n",
1401+
"PySpark enables users to parameterize queries, which simplifies unit testing of SQL queries. In this example, the `df` and `amount` variables are parameterized to verify whether the `actual_df` matches the `expected_df`."
1402+
]
1403+
},
1404+
{
1405+
"cell_type": "code",
1406+
"execution_count": 54,
1407+
"id": "14d313f8",
1408+
"metadata": {},
1409+
"outputs": [
1410+
{
1411+
"name": "stderr",
1412+
"output_type": "stream",
1413+
"text": [
1414+
" \r"
1415+
]
1416+
},
1417+
{
1418+
"name": "stdout",
1419+
"output_type": "stream",
1420+
"text": [
1421+
"\u001b[32m.\u001b[0m\u001b[32m [100%]\u001b[0m\n"
1422+
]
1423+
}
1424+
],
1425+
"source": [
1426+
"%%ipytest -qq\n",
1427+
"import pytest\n",
1428+
"from pyspark.testing import assertDataFrameEqual\n",
1429+
"\n",
1430+
"\n",
1431+
"@pytest.fixture\n",
1432+
"def query():\n",
1433+
" return \"SELECT * from {df} where price > {amount} AND name LIKE '%Product%';\"\n",
1434+
"\n",
1435+
"\n",
1436+
"def test_query_return_correct_number_of_rows(query):\n",
1437+
"\n",
1438+
" spark = SparkSession.builder.getOrCreate()\n",
1439+
"\n",
1440+
" # Create a sample DataFrame\n",
1441+
" df = spark.createDataFrame(\n",
1442+
" [\n",
1443+
" (\"Product 1\", 10.0, 5),\n",
1444+
" (\"Product 2\", 15.0, 3),\n",
1445+
" (\"Product 3\", 8.0, 2),\n",
1446+
" ],\n",
1447+
" [\"name\", \"price\", \"quantity\"],\n",
1448+
" )\n",
1449+
"\n",
1450+
" # Execute the query\n",
1451+
" actual_df = spark.sql(query, df=df, amount=10)\n",
1452+
"\n",
1453+
" # Assert the result\n",
1454+
" expected_df = spark.createDataFrame(\n",
1455+
" [\n",
1456+
" (\"Product 2\", 15.0, 3),\n",
1457+
" ],\n",
1458+
" [\"name\", \"price\", \"quantity\"],\n",
1459+
" )\n",
1460+
" assertDataFrameEqual(actual_df, expected_df)"
1461+
]
13561462
}
13571463
],
13581464
"metadata": {

docs/searchindex.js

Lines changed: 1 addition & 1 deletion
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

0 commit comments

Comments
 (0)