|
234 | 234 | <li class="toctree-l2"><a class="reference internal" href="../Chapter2/dataclasses.html">3.7. Data Classes</a></li>
|
235 | 235 | <li class="toctree-l2"><a class="reference internal" href="../Chapter2/typing.html">3.8. Typing</a></li>
|
236 | 236 | <li class="toctree-l2"><a class="reference internal" href="../Chapter2/pathlib.html">3.9. pathlib</a></li>
|
| 237 | +<li class="toctree-l2"><a class="reference internal" href="../Chapter2/pydantic.html">3.10. Pydantic</a></li> |
237 | 238 | </ul>
|
238 | 239 | </li>
|
239 | 240 | <li class="toctree-l1 has-children"><a class="reference internal" href="../Chapter3/Chapter3.html">4. Pandas</a><input class="toctree-checkbox" id="toctree-checkbox-4" name="toctree-checkbox-4" type="checkbox"/><label class="toctree-toggle" for="toctree-checkbox-4"><i class="fa-solid fa-chevron-down"></i></label><ul>
|
@@ -519,6 +520,7 @@ <h2> Contents </h2>
|
519 | 520 | <li class="toc-h2 nav-item toc-entry"><a class="reference internal nav-link" href="#simplify-complex-sql-queries-with-pyspark-udfs">6.15.5. Simplify Complex SQL Queries with PySpark UDFs</a></li>
|
520 | 521 | <li class="toc-h2 nav-item toc-entry"><a class="reference internal nav-link" href="#leverage-spark-udfs-for-reusable-complex-logic-in-sql-queries">6.15.6. Leverage Spark UDFs for Reusable Complex Logic in SQL Queries</a></li>
|
521 | 522 | <li class="toc-h2 nav-item toc-entry"><a class="reference internal nav-link" href="#simplify-unit-testing-of-sql-queries-with-pyspark">6.15.7. Simplify Unit Testing of SQL Queries with PySpark</a></li>
|
| 523 | +<li class="toc-h2 nav-item toc-entry"><a class="reference internal nav-link" href="#update-multiple-columns-in-spark-3-3-and-later">6.15.8. Update Multiple Columns in Spark 3.3 and Later</a></li> |
522 | 524 | </ul>
|
523 | 525 | </nav>
|
524 | 526 | </div>
|
@@ -1520,6 +1522,113 @@ <h2><span class="section-number">6.15.7. </span>Simplify Unit Testing of SQL Que
|
1520 | 1522 | </div>
|
1521 | 1523 | </div>
|
1522 | 1524 | </section>
|
| 1525 | +<section id="update-multiple-columns-in-spark-3-3-and-later"> |
| 1526 | +<h2><span class="section-number">6.15.8. </span>Update Multiple Columns in Spark 3.3 and Later<a class="headerlink" href="#update-multiple-columns-in-spark-3-3-and-later" title="Permalink to this heading">#</a></h2> |
| 1527 | +<div class="cell tag_hide-cell docutils container"> |
| 1528 | +<details class="hide above-input"> |
| 1529 | +<summary aria-label="Toggle hidden content"> |
| 1530 | +<span class="collapsed">Show code cell content</span> |
| 1531 | +<span class="expanded">Hide code cell content</span> |
| 1532 | +</summary> |
| 1533 | +<div class="cell_input docutils container"> |
| 1534 | +<div class="highlight-ipython3 notranslate"><div class="highlight"><pre><span></span><span class="o">!</span>pip<span class="w"> </span>install<span class="w"> </span>-U<span class="w"> </span><span class="s2">"pyspark[sql]"</span> |
| 1535 | +</pre></div> |
| 1536 | +</div> |
| 1537 | +</div> |
| 1538 | +</details> |
| 1539 | +</div> |
| 1540 | +<div class="cell tag_hide-cell docutils container"> |
| 1541 | +<details class="hide above-input"> |
| 1542 | +<summary aria-label="Toggle hidden content"> |
| 1543 | +<span class="collapsed">Show code cell content</span> |
| 1544 | +<span class="expanded">Hide code cell content</span> |
| 1545 | +</summary> |
| 1546 | +<div class="cell_input docutils container"> |
| 1547 | +<div class="highlight-ipython3 notranslate"><div class="highlight"><pre><span></span><span class="kn">from</span> <span class="nn">pyspark.sql</span> <span class="kn">import</span> <span class="n">SparkSession</span> |
| 1548 | + |
| 1549 | +<span class="c1"># Create SparkSession</span> |
| 1550 | +<span class="n">spark</span> <span class="o">=</span> <span class="n">SparkSession</span><span class="o">.</span><span class="n">builder</span><span class="o">.</span><span class="n">getOrCreate</span><span class="p">()</span> |
| 1551 | +</pre></div> |
| 1552 | +</div> |
| 1553 | +</div> |
| 1554 | +</details> |
| 1555 | +</div> |
| 1556 | +<div class="cell docutils container"> |
| 1557 | +<div class="cell_input docutils container"> |
| 1558 | +<div class="highlight-ipython3 notranslate"><div class="highlight"><pre><span></span><span class="kn">from</span> <span class="nn">pyspark.sql.functions</span> <span class="kn">import</span> <span class="n">col</span><span class="p">,</span> <span class="n">trim</span> |
| 1559 | + |
| 1560 | +<span class="c1"># Create a sample DataFrame</span> |
| 1561 | +<span class="n">data</span> <span class="o">=</span> <span class="p">[(</span><span class="s2">" John "</span><span class="p">,</span> <span class="mi">35</span><span class="p">),</span> <span class="p">(</span><span class="s2">"Jane"</span><span class="p">,</span> <span class="mi">28</span><span class="p">)]</span> |
| 1562 | +<span class="n">columns</span> <span class="o">=</span> <span class="p">[</span><span class="s2">"first_name"</span><span class="p">,</span> <span class="s2">"age"</span><span class="p">]</span> |
| 1563 | +<span class="n">df</span> <span class="o">=</span> <span class="n">spark</span><span class="o">.</span><span class="n">createDataFrame</span><span class="p">(</span><span class="n">data</span><span class="p">,</span> <span class="n">columns</span><span class="p">)</span> |
| 1564 | +<span class="n">df</span><span class="o">.</span><span class="n">show</span><span class="p">()</span> |
| 1565 | +</pre></div> |
| 1566 | +</div> |
| 1567 | +</div> |
| 1568 | +<div class="cell_output docutils container"> |
| 1569 | +<div class="output stream highlight-myst-ansi notranslate"><div class="highlight"><pre><span></span>+----------+---+ |
| 1570 | +|first_name|age| |
| 1571 | ++----------+---+ |
| 1572 | +| John | 35| |
| 1573 | +| Jane| 28| |
| 1574 | ++----------+---+ |
| 1575 | +</pre></div> |
| 1576 | +</div> |
| 1577 | +</div> |
| 1578 | +</div> |
| 1579 | +<p>Prior to PySpark 3.3, appending multiple columns to a Spark DataFrame required chaining multiple <code class="docutils literal notranslate"><span class="pre">withColumn</span></code> calls.</p> |
| 1580 | +<div class="cell docutils container"> |
| 1581 | +<div class="cell_input docutils container"> |
| 1582 | +<div class="highlight-ipython3 notranslate"><div class="highlight"><pre><span></span><span class="c1"># Before Spark 3.3 </span> |
| 1583 | +<span class="n">new_df</span> <span class="o">=</span> <span class="p">(</span><span class="n">df</span> |
| 1584 | + <span class="o">.</span><span class="n">withColumn</span><span class="p">(</span><span class="s2">"first_name"</span><span class="p">,</span> <span class="n">trim</span><span class="p">(</span><span class="n">col</span><span class="p">(</span><span class="s2">"first_name"</span><span class="p">)))</span> |
| 1585 | + <span class="o">.</span><span class="n">withColumn</span><span class="p">(</span><span class="s2">"age_after_10_years"</span><span class="p">,</span> <span class="n">col</span><span class="p">(</span><span class="s2">"age"</span><span class="p">)</span> <span class="o">+</span> <span class="mi">10</span><span class="p">)</span> |
| 1586 | + <span class="p">)</span> |
| 1587 | + |
| 1588 | +<span class="n">new_df</span><span class="o">.</span><span class="n">show</span><span class="p">()</span> |
| 1589 | +</pre></div> |
| 1590 | +</div> |
| 1591 | +</div> |
| 1592 | +<div class="cell_output docutils container"> |
| 1593 | +<div class="output stream highlight-myst-ansi notranslate"><div class="highlight"><pre><span></span>+----------+---+------------------+ |
| 1594 | +|first_name|age|age_after_10_years| |
| 1595 | ++----------+---+------------------+ |
| 1596 | +| John| 35| 45| |
| 1597 | +| Jane| 28| 38| |
| 1598 | ++----------+---+------------------+ |
| 1599 | +</pre></div> |
| 1600 | +</div> |
| 1601 | +</div> |
| 1602 | +</div> |
| 1603 | +<p>In PySpark 3.3 and later, you can use the withColumns method in a dictionary style to append multiple columns to a DataFrame. This syntax is more user-friendly for pandas users.</p> |
| 1604 | +<div class="cell docutils container"> |
| 1605 | +<div class="cell_input docutils container"> |
| 1606 | +<div class="highlight-ipython3 notranslate"><div class="highlight"><pre><span></span><span class="n">new_df</span> <span class="o">=</span> <span class="n">df</span><span class="o">.</span><span class="n">withColumns</span><span class="p">(</span> |
| 1607 | + <span class="p">{</span> |
| 1608 | + <span class="s2">"first_name"</span><span class="p">:</span> <span class="n">trim</span><span class="p">(</span><span class="n">col</span><span class="p">(</span><span class="s2">"first_name"</span><span class="p">)),</span> |
| 1609 | + <span class="s2">"age_after_10_years"</span><span class="p">:</span> <span class="n">col</span><span class="p">(</span><span class="s2">"age"</span><span class="p">)</span> <span class="o">+</span> <span class="mi">10</span><span class="p">,</span> |
| 1610 | + <span class="p">}</span> |
| 1611 | +<span class="p">)</span> |
| 1612 | + |
| 1613 | +<span class="n">new_df</span><span class="o">.</span><span class="n">show</span><span class="p">()</span> |
| 1614 | +</pre></div> |
| 1615 | +</div> |
| 1616 | +</div> |
| 1617 | +<div class="cell_output docutils container"> |
| 1618 | +<div class="output stderr highlight-myst-ansi notranslate"><div class="highlight"><pre><span></span> |
| 1619 | +</pre></div> |
| 1620 | +</div> |
| 1621 | +<div class="output stream highlight-myst-ansi notranslate"><div class="highlight"><pre><span></span>+----------+---+------------------+ |
| 1622 | +|first_name|age|age_after_10_years| |
| 1623 | ++----------+---+------------------+ |
| 1624 | +| John| 35| 45| |
| 1625 | +| Jane| 28| 38| |
| 1626 | ++----------+---+------------------+ |
| 1627 | +</pre></div> |
| 1628 | +</div> |
| 1629 | +</div> |
| 1630 | +</div> |
| 1631 | +</section> |
1523 | 1632 | </section>
|
1524 | 1633 |
|
1525 | 1634 | <script type="text/x-thebe-config">
|
@@ -1592,6 +1701,7 @@ <h2><span class="section-number">6.15.7. </span>Simplify Unit Testing of SQL Que
|
1592 | 1701 | <li class="toc-h2 nav-item toc-entry"><a class="reference internal nav-link" href="#simplify-complex-sql-queries-with-pyspark-udfs">6.15.5. Simplify Complex SQL Queries with PySpark UDFs</a></li>
|
1593 | 1702 | <li class="toc-h2 nav-item toc-entry"><a class="reference internal nav-link" href="#leverage-spark-udfs-for-reusable-complex-logic-in-sql-queries">6.15.6. Leverage Spark UDFs for Reusable Complex Logic in SQL Queries</a></li>
|
1594 | 1703 | <li class="toc-h2 nav-item toc-entry"><a class="reference internal nav-link" href="#simplify-unit-testing-of-sql-queries-with-pyspark">6.15.7. Simplify Unit Testing of SQL Queries with PySpark</a></li>
|
| 1704 | +<li class="toc-h2 nav-item toc-entry"><a class="reference internal nav-link" href="#update-multiple-columns-in-spark-3-3-and-later">6.15.8. Update Multiple Columns in Spark 3.3 and Later</a></li> |
1595 | 1705 | </ul>
|
1596 | 1706 | </nav></div>
|
1597 | 1707 |
|
|
0 commit comments