Skip to content

Commit f2d1646

Browse files
add pyspark
1 parent fbea331 commit f2d1646

File tree

4 files changed

+417
-1
lines changed

4 files changed

+417
-1
lines changed

Chapter5/spark.ipynb

Lines changed: 153 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1459,6 +1459,159 @@
14591459
" )\n",
14601460
" assertDataFrameEqual(actual_df, expected_df)"
14611461
]
1462+
},
1463+
{
1464+
"cell_type": "markdown",
1465+
"id": "f102a2ba",
1466+
"metadata": {},
1467+
"source": [
1468+
"### Update Multiple Columns in Spark 3.3 and Later"
1469+
]
1470+
},
1471+
{
1472+
"cell_type": "code",
1473+
"execution_count": null,
1474+
"id": "70283a1f",
1475+
"metadata": {
1476+
"tags": [
1477+
"hide-cell"
1478+
]
1479+
},
1480+
"outputs": [],
1481+
"source": [
1482+
"!pip install -U \"pyspark[sql]\""
1483+
]
1484+
},
1485+
{
1486+
"cell_type": "code",
1487+
"execution_count": null,
1488+
"id": "fa6b1afd",
1489+
"metadata": {
1490+
"tags": [
1491+
"hide-cell"
1492+
]
1493+
},
1494+
"outputs": [],
1495+
"source": [
1496+
"from pyspark.sql import SparkSession\n",
1497+
"\n",
1498+
"# Create SparkSession\n",
1499+
"spark = SparkSession.builder.getOrCreate()"
1500+
]
1501+
},
1502+
{
1503+
"cell_type": "code",
1504+
"execution_count": 4,
1505+
"id": "327cc772",
1506+
"metadata": {},
1507+
"outputs": [
1508+
{
1509+
"name": "stdout",
1510+
"output_type": "stream",
1511+
"text": [
1512+
"+----------+---+\n",
1513+
"|first_name|age|\n",
1514+
"+----------+---+\n",
1515+
"| John | 35|\n",
1516+
"| Jane| 28|\n",
1517+
"+----------+---+\n",
1518+
"\n"
1519+
]
1520+
}
1521+
],
1522+
"source": [
1523+
"from pyspark.sql.functions import col, trim\n",
1524+
"\n",
1525+
"# Create a sample DataFrame\n",
1526+
"data = [(\" John \", 35), (\"Jane\", 28)]\n",
1527+
"columns = [\"first_name\", \"age\"]\n",
1528+
"df = spark.createDataFrame(data, columns)\n",
1529+
"df.show()"
1530+
]
1531+
},
1532+
{
1533+
"cell_type": "markdown",
1534+
"id": "95b37c9a",
1535+
"metadata": {},
1536+
"source": [
1537+
"Prior to PySpark 3.3, appending multiple columns to a Spark DataFrame required chaining multiple `withColumn` calls."
1538+
]
1539+
},
1540+
{
1541+
"cell_type": "code",
1542+
"execution_count": 12,
1543+
"id": "9e38d06c",
1544+
"metadata": {},
1545+
"outputs": [
1546+
{
1547+
"name": "stdout",
1548+
"output_type": "stream",
1549+
"text": [
1550+
"+----------+---+------------------+\n",
1551+
"|first_name|age|age_after_10_years|\n",
1552+
"+----------+---+------------------+\n",
1553+
"| John| 35| 45|\n",
1554+
"| Jane| 28| 38|\n",
1555+
"+----------+---+------------------+\n",
1556+
"\n"
1557+
]
1558+
}
1559+
],
1560+
"source": [
1561+
"# Before Spark 3.3 \n",
1562+
"new_df = (df\n",
1563+
" .withColumn(\"first_name\", trim(col(\"first_name\")))\n",
1564+
" .withColumn(\"age_after_10_years\", col(\"age\") + 10)\n",
1565+
" )\n",
1566+
"\n",
1567+
"new_df.show()"
1568+
]
1569+
},
1570+
{
1571+
"cell_type": "markdown",
1572+
"id": "dc42fddd",
1573+
"metadata": {},
1574+
"source": [
1575+
"In PySpark 3.3 and later, you can use the withColumns method in a dictionary style to append multiple columns to a DataFrame. This syntax is more user-friendly for pandas users."
1576+
]
1577+
},
1578+
{
1579+
"cell_type": "code",
1580+
"execution_count": 3,
1581+
"id": "ae122634",
1582+
"metadata": {},
1583+
"outputs": [
1584+
{
1585+
"name": "stderr",
1586+
"output_type": "stream",
1587+
"text": [
1588+
" \r"
1589+
]
1590+
},
1591+
{
1592+
"name": "stdout",
1593+
"output_type": "stream",
1594+
"text": [
1595+
"+----------+---+------------------+\n",
1596+
"|first_name|age|age_after_10_years|\n",
1597+
"+----------+---+------------------+\n",
1598+
"| John| 35| 45|\n",
1599+
"| Jane| 28| 38|\n",
1600+
"+----------+---+------------------+\n",
1601+
"\n"
1602+
]
1603+
}
1604+
],
1605+
"source": [
1606+
"new_df = df.withColumns(\n",
1607+
" {\n",
1608+
" \"first_name\": trim(col(\"first_name\")),\n",
1609+
" \"age_after_10_years\": col(\"age\") + 10,\n",
1610+
" }\n",
1611+
")\n",
1612+
"\n",
1613+
"new_df.show()"
1614+
]
14621615
}
14631616
],
14641617
"metadata": {

docs/Chapter5/spark.html

Lines changed: 110 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -234,6 +234,7 @@
234234
<li class="toctree-l2"><a class="reference internal" href="../Chapter2/dataclasses.html">3.7. Data Classes</a></li>
235235
<li class="toctree-l2"><a class="reference internal" href="../Chapter2/typing.html">3.8. Typing</a></li>
236236
<li class="toctree-l2"><a class="reference internal" href="../Chapter2/pathlib.html">3.9. pathlib</a></li>
237+
<li class="toctree-l2"><a class="reference internal" href="../Chapter2/pydantic.html">3.10. Pydantic</a></li>
237238
</ul>
238239
</li>
239240
<li class="toctree-l1 has-children"><a class="reference internal" href="../Chapter3/Chapter3.html">4. Pandas</a><input class="toctree-checkbox" id="toctree-checkbox-4" name="toctree-checkbox-4" type="checkbox"/><label class="toctree-toggle" for="toctree-checkbox-4"><i class="fa-solid fa-chevron-down"></i></label><ul>
@@ -519,6 +520,7 @@ <h2> Contents </h2>
519520
<li class="toc-h2 nav-item toc-entry"><a class="reference internal nav-link" href="#simplify-complex-sql-queries-with-pyspark-udfs">6.15.5. Simplify Complex SQL Queries with PySpark UDFs</a></li>
520521
<li class="toc-h2 nav-item toc-entry"><a class="reference internal nav-link" href="#leverage-spark-udfs-for-reusable-complex-logic-in-sql-queries">6.15.6. Leverage Spark UDFs for Reusable Complex Logic in SQL Queries</a></li>
521522
<li class="toc-h2 nav-item toc-entry"><a class="reference internal nav-link" href="#simplify-unit-testing-of-sql-queries-with-pyspark">6.15.7. Simplify Unit Testing of SQL Queries with PySpark</a></li>
523+
<li class="toc-h2 nav-item toc-entry"><a class="reference internal nav-link" href="#update-multiple-columns-in-spark-3-3-and-later">6.15.8. Update Multiple Columns in Spark 3.3 and Later</a></li>
522524
</ul>
523525
</nav>
524526
</div>
@@ -1520,6 +1522,113 @@ <h2><span class="section-number">6.15.7. </span>Simplify Unit Testing of SQL Que
15201522
</div>
15211523
</div>
15221524
</section>
1525+
<section id="update-multiple-columns-in-spark-3-3-and-later">
1526+
<h2><span class="section-number">6.15.8. </span>Update Multiple Columns in Spark 3.3 and Later<a class="headerlink" href="#update-multiple-columns-in-spark-3-3-and-later" title="Permalink to this heading">#</a></h2>
1527+
<div class="cell tag_hide-cell docutils container">
1528+
<details class="hide above-input">
1529+
<summary aria-label="Toggle hidden content">
1530+
<span class="collapsed">Show code cell content</span>
1531+
<span class="expanded">Hide code cell content</span>
1532+
</summary>
1533+
<div class="cell_input docutils container">
1534+
<div class="highlight-ipython3 notranslate"><div class="highlight"><pre><span></span><span class="o">!</span>pip<span class="w"> </span>install<span class="w"> </span>-U<span class="w"> </span><span class="s2">&quot;pyspark[sql]&quot;</span>
1535+
</pre></div>
1536+
</div>
1537+
</div>
1538+
</details>
1539+
</div>
1540+
<div class="cell tag_hide-cell docutils container">
1541+
<details class="hide above-input">
1542+
<summary aria-label="Toggle hidden content">
1543+
<span class="collapsed">Show code cell content</span>
1544+
<span class="expanded">Hide code cell content</span>
1545+
</summary>
1546+
<div class="cell_input docutils container">
1547+
<div class="highlight-ipython3 notranslate"><div class="highlight"><pre><span></span><span class="kn">from</span> <span class="nn">pyspark.sql</span> <span class="kn">import</span> <span class="n">SparkSession</span>
1548+
1549+
<span class="c1"># Create SparkSession</span>
1550+
<span class="n">spark</span> <span class="o">=</span> <span class="n">SparkSession</span><span class="o">.</span><span class="n">builder</span><span class="o">.</span><span class="n">getOrCreate</span><span class="p">()</span>
1551+
</pre></div>
1552+
</div>
1553+
</div>
1554+
</details>
1555+
</div>
1556+
<div class="cell docutils container">
1557+
<div class="cell_input docutils container">
1558+
<div class="highlight-ipython3 notranslate"><div class="highlight"><pre><span></span><span class="kn">from</span> <span class="nn">pyspark.sql.functions</span> <span class="kn">import</span> <span class="n">col</span><span class="p">,</span> <span class="n">trim</span>
1559+
1560+
<span class="c1"># Create a sample DataFrame</span>
1561+
<span class="n">data</span> <span class="o">=</span> <span class="p">[(</span><span class="s2">&quot; John &quot;</span><span class="p">,</span> <span class="mi">35</span><span class="p">),</span> <span class="p">(</span><span class="s2">&quot;Jane&quot;</span><span class="p">,</span> <span class="mi">28</span><span class="p">)]</span>
1562+
<span class="n">columns</span> <span class="o">=</span> <span class="p">[</span><span class="s2">&quot;first_name&quot;</span><span class="p">,</span> <span class="s2">&quot;age&quot;</span><span class="p">]</span>
1563+
<span class="n">df</span> <span class="o">=</span> <span class="n">spark</span><span class="o">.</span><span class="n">createDataFrame</span><span class="p">(</span><span class="n">data</span><span class="p">,</span> <span class="n">columns</span><span class="p">)</span>
1564+
<span class="n">df</span><span class="o">.</span><span class="n">show</span><span class="p">()</span>
1565+
</pre></div>
1566+
</div>
1567+
</div>
1568+
<div class="cell_output docutils container">
1569+
<div class="output stream highlight-myst-ansi notranslate"><div class="highlight"><pre><span></span>+----------+---+
1570+
|first_name|age|
1571+
+----------+---+
1572+
| John | 35|
1573+
| Jane| 28|
1574+
+----------+---+
1575+
</pre></div>
1576+
</div>
1577+
</div>
1578+
</div>
1579+
<p>Prior to PySpark 3.3, appending multiple columns to a Spark DataFrame required chaining multiple <code class="docutils literal notranslate"><span class="pre">withColumn</span></code> calls.</p>
1580+
<div class="cell docutils container">
1581+
<div class="cell_input docutils container">
1582+
<div class="highlight-ipython3 notranslate"><div class="highlight"><pre><span></span><span class="c1"># Before Spark 3.3 </span>
1583+
<span class="n">new_df</span> <span class="o">=</span> <span class="p">(</span><span class="n">df</span>
1584+
<span class="o">.</span><span class="n">withColumn</span><span class="p">(</span><span class="s2">&quot;first_name&quot;</span><span class="p">,</span> <span class="n">trim</span><span class="p">(</span><span class="n">col</span><span class="p">(</span><span class="s2">&quot;first_name&quot;</span><span class="p">)))</span>
1585+
<span class="o">.</span><span class="n">withColumn</span><span class="p">(</span><span class="s2">&quot;age_after_10_years&quot;</span><span class="p">,</span> <span class="n">col</span><span class="p">(</span><span class="s2">&quot;age&quot;</span><span class="p">)</span> <span class="o">+</span> <span class="mi">10</span><span class="p">)</span>
1586+
<span class="p">)</span>
1587+
1588+
<span class="n">new_df</span><span class="o">.</span><span class="n">show</span><span class="p">()</span>
1589+
</pre></div>
1590+
</div>
1591+
</div>
1592+
<div class="cell_output docutils container">
1593+
<div class="output stream highlight-myst-ansi notranslate"><div class="highlight"><pre><span></span>+----------+---+------------------+
1594+
|first_name|age|age_after_10_years|
1595+
+----------+---+------------------+
1596+
| John| 35| 45|
1597+
| Jane| 28| 38|
1598+
+----------+---+------------------+
1599+
</pre></div>
1600+
</div>
1601+
</div>
1602+
</div>
1603+
<p>In PySpark 3.3 and later, you can use the withColumns method in a dictionary style to append multiple columns to a DataFrame. This syntax is more user-friendly for pandas users.</p>
1604+
<div class="cell docutils container">
1605+
<div class="cell_input docutils container">
1606+
<div class="highlight-ipython3 notranslate"><div class="highlight"><pre><span></span><span class="n">new_df</span> <span class="o">=</span> <span class="n">df</span><span class="o">.</span><span class="n">withColumns</span><span class="p">(</span>
1607+
<span class="p">{</span>
1608+
<span class="s2">&quot;first_name&quot;</span><span class="p">:</span> <span class="n">trim</span><span class="p">(</span><span class="n">col</span><span class="p">(</span><span class="s2">&quot;first_name&quot;</span><span class="p">)),</span>
1609+
<span class="s2">&quot;age_after_10_years&quot;</span><span class="p">:</span> <span class="n">col</span><span class="p">(</span><span class="s2">&quot;age&quot;</span><span class="p">)</span> <span class="o">+</span> <span class="mi">10</span><span class="p">,</span>
1610+
<span class="p">}</span>
1611+
<span class="p">)</span>
1612+
1613+
<span class="n">new_df</span><span class="o">.</span><span class="n">show</span><span class="p">()</span>
1614+
</pre></div>
1615+
</div>
1616+
</div>
1617+
<div class="cell_output docutils container">
1618+
<div class="output stderr highlight-myst-ansi notranslate"><div class="highlight"><pre><span></span>
1619+
</pre></div>
1620+
</div>
1621+
<div class="output stream highlight-myst-ansi notranslate"><div class="highlight"><pre><span></span>+----------+---+------------------+
1622+
|first_name|age|age_after_10_years|
1623+
+----------+---+------------------+
1624+
| John| 35| 45|
1625+
| Jane| 28| 38|
1626+
+----------+---+------------------+
1627+
</pre></div>
1628+
</div>
1629+
</div>
1630+
</div>
1631+
</section>
15231632
</section>
15241633

15251634
<script type="text/x-thebe-config">
@@ -1592,6 +1701,7 @@ <h2><span class="section-number">6.15.7. </span>Simplify Unit Testing of SQL Que
15921701
<li class="toc-h2 nav-item toc-entry"><a class="reference internal nav-link" href="#simplify-complex-sql-queries-with-pyspark-udfs">6.15.5. Simplify Complex SQL Queries with PySpark UDFs</a></li>
15931702
<li class="toc-h2 nav-item toc-entry"><a class="reference internal nav-link" href="#leverage-spark-udfs-for-reusable-complex-logic-in-sql-queries">6.15.6. Leverage Spark UDFs for Reusable Complex Logic in SQL Queries</a></li>
15941703
<li class="toc-h2 nav-item toc-entry"><a class="reference internal nav-link" href="#simplify-unit-testing-of-sql-queries-with-pyspark">6.15.7. Simplify Unit Testing of SQL Queries with PySpark</a></li>
1704+
<li class="toc-h2 nav-item toc-entry"><a class="reference internal nav-link" href="#update-multiple-columns-in-spark-3-3-and-later">6.15.8. Update Multiple Columns in Spark 3.3 and Later</a></li>
15951705
</ul>
15961706
</nav></div>
15971707

0 commit comments

Comments
 (0)