Skip to content

Commit 33061e0

Browse files
add pyspark
1 parent 34f05b8 commit 33061e0

File tree

4 files changed

+426
-1
lines changed

4 files changed

+426
-1
lines changed

Chapter5/spark.ipynb

Lines changed: 155 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1951,6 +1951,161 @@
19511951
"- **Familiarity**: Use spark.sql() if your team prefers SQL syntax. Use the DataFrame API if chained method calls are more intuitive for your team.\n",
19521952
"- **Complexity of Transformations**: The DataFrame API is more flexible for complex manipulations, while SQL is more concise for simpler queries."
19531953
]
1954+
},
1955+
{
1956+
"cell_type": "markdown",
1957+
"id": "f7bb85d6",
1958+
"metadata": {},
1959+
"source": [
1960+
"### Enhance Code Modularity and Reusability with Temporary Views in PySpark"
1961+
]
1962+
},
1963+
{
1964+
"cell_type": "code",
1965+
"execution_count": null,
1966+
"id": "511c6792",
1967+
"metadata": {
1968+
"tags": [
1969+
"hide-cell"
1970+
]
1971+
},
1972+
"outputs": [],
1973+
"source": [
1974+
"!pip install -U 'pyspark[sql]'\n"
1975+
]
1976+
},
1977+
{
1978+
"cell_type": "code",
1979+
"execution_count": 17,
1980+
"id": "9ab976de",
1981+
"metadata": {
1982+
"tags": [
1983+
"hide-cell"
1984+
]
1985+
},
1986+
"outputs": [],
1987+
"source": [
1988+
"from pyspark.sql import SparkSession\n",
1989+
"\n",
1990+
"# Create SparkSession\n",
1991+
"spark = SparkSession.builder.getOrCreate()"
1992+
]
1993+
},
1994+
{
1995+
"cell_type": "markdown",
1996+
"id": "2ac1a0a7",
1997+
"metadata": {},
1998+
"source": [
1999+
"In PySpark, temporary views enable SQL query operations on a DataFrame. They are stored in memory, resulting in faster query times compared to accessing the original DataFrame directly.\n",
2000+
"\n",
2001+
"To demonstrate this, let's create a PySpark DataFrame called `orders_df`."
2002+
]
2003+
},
2004+
{
2005+
"cell_type": "code",
2006+
"execution_count": 19,
2007+
"id": "e4cf261e",
2008+
"metadata": {},
2009+
"outputs": [],
2010+
"source": [
2011+
"# Create a sample DataFrame\n",
2012+
"data = [\n",
2013+
" (1001, \"John Doe\", 500.0),\n",
2014+
" (1002, \"Jane Smith\", 750.0),\n",
2015+
" (1003, \"Bob Johnson\", 300.0),\n",
2016+
" (1004, \"Sarah Lee\", 400.0),\n",
2017+
" (1005, \"Tom Wilson\", 600.0),\n",
2018+
"]\n",
2019+
"\n",
2020+
"columns = [\"customer_id\", \"customer_name\", \"revenue\"]\n",
2021+
"orders_df = spark.createDataFrame(data, columns)"
2022+
]
2023+
},
2024+
{
2025+
"cell_type": "markdown",
2026+
"id": "cf288495",
2027+
"metadata": {},
2028+
"source": [
2029+
"Next, create a temporary view called `orders` from the `orders_df` DataFrame using the `createOrReplaceTempView` method. "
2030+
]
2031+
},
2032+
{
2033+
"cell_type": "code",
2034+
"execution_count": 21,
2035+
"id": "019a8451",
2036+
"metadata": {},
2037+
"outputs": [],
2038+
"source": [
2039+
"# Create a temporary view\n",
2040+
"orders_df.createOrReplaceTempView(\"orders\")"
2041+
]
2042+
},
2043+
{
2044+
"cell_type": "markdown",
2045+
"id": "43a2a924",
2046+
"metadata": {},
2047+
"source": [
2048+
"With the temporary view created, we can perform various operations on it using SQL queries. "
2049+
]
2050+
},
2051+
{
2052+
"cell_type": "code",
2053+
"execution_count": 23,
2054+
"id": "f88486c5",
2055+
"metadata": {},
2056+
"outputs": [
2057+
{
2058+
"name": "stdout",
2059+
"output_type": "stream",
2060+
"text": [
2061+
"Total Revenue:\n",
2062+
"+-------------+\n",
2063+
"|total_revenue|\n",
2064+
"+-------------+\n",
2065+
"| 2550.0|\n",
2066+
"+-------------+\n",
2067+
"\n",
2068+
"\n",
2069+
"Top 10 Customers by Total Revenue:\n",
2070+
"+-----------+-------------+\n",
2071+
"|customer_id|total_revenue|\n",
2072+
"+-----------+-------------+\n",
2073+
"| 1002| 750.0|\n",
2074+
"| 1005| 600.0|\n",
2075+
"| 1001| 500.0|\n",
2076+
"| 1004| 400.0|\n",
2077+
"| 1003| 300.0|\n",
2078+
"+-----------+-------------+\n",
2079+
"\n",
2080+
"\n",
2081+
"Number of Orders:\n",
2082+
"+-----------+\n",
2083+
"|order_count|\n",
2084+
"+-----------+\n",
2085+
"| 5|\n",
2086+
"+-----------+\n",
2087+
"\n"
2088+
]
2089+
}
2090+
],
2091+
"source": [
2092+
"# Perform operations on the temporary view\n",
2093+
"total_revenue = spark.sql(\"SELECT SUM(revenue) AS total_revenue FROM orders\")\n",
2094+
"top_customers = spark.sql(\n",
2095+
" \"SELECT customer_id, SUM(revenue) AS total_revenue FROM orders GROUP BY customer_id ORDER BY total_revenue DESC LIMIT 10\"\n",
2096+
")\n",
2097+
"order_count = spark.sql(\"SELECT COUNT(*) AS order_count FROM orders\")\n",
2098+
"\n",
2099+
"# Display the results\n",
2100+
"print(\"Total Revenue:\")\n",
2101+
"total_revenue.show()\n",
2102+
"\n",
2103+
"print(\"\\nTop 10 Customers by Total Revenue:\")\n",
2104+
"top_customers.show()\n",
2105+
"\n",
2106+
"print(\"\\nNumber of Orders:\")\n",
2107+
"order_count.show()"
2108+
]
19542109
}
19552110
],
19562111
"metadata": {

docs/Chapter5/spark.html

Lines changed: 115 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -523,6 +523,7 @@ <h2> Contents </h2>
523523
<li class="toc-h2 nav-item toc-entry"><a class="reference internal nav-link" href="#update-multiple-columns-in-spark-3-3-and-later">6.15.8. Update Multiple Columns in Spark 3.3 and Later</a></li>
524524
<li class="toc-h2 nav-item toc-entry"><a class="reference internal nav-link" href="#vectorized-operations-in-pyspark-pandas-udf-vs-standard-udf">6.15.9. Vectorized Operations in PySpark: pandas_udf vs Standard UDF</a></li>
525525
<li class="toc-h2 nav-item toc-entry"><a class="reference internal nav-link" href="#optimizing-pyspark-queries-dataframe-api-or-sql">6.15.10. Optimizing PySpark Queries: DataFrame API or SQL?</a></li>
526+
<li class="toc-h2 nav-item toc-entry"><a class="reference internal nav-link" href="#enhance-code-modularity-and-reusability-with-temporary-views-in-pyspark">6.15.11. Enhance Code Modularity and Reusability with Temporary Views in PySpark</a></li>
526527
</ul>
527528
</nav>
528529
</div>
@@ -1847,6 +1848,119 @@ <h2><span class="section-number">6.15.10. </span>Optimizing PySpark Queries: Dat
18471848
<li><p><strong>Complexity of Transformations</strong>: The DataFrame API is more flexible for complex manipulations, while SQL is more concise for simpler queries.</p></li>
18481849
</ul>
18491850
</section>
1851+
<section id="enhance-code-modularity-and-reusability-with-temporary-views-in-pyspark">
1852+
<h2><span class="section-number">6.15.11. </span>Enhance Code Modularity and Reusability with Temporary Views in PySpark<a class="headerlink" href="#enhance-code-modularity-and-reusability-with-temporary-views-in-pyspark" title="Permalink to this heading">#</a></h2>
1853+
<div class="cell tag_hide-cell docutils container">
1854+
<details class="hide above-input">
1855+
<summary aria-label="Toggle hidden content">
1856+
<span class="collapsed">Show code cell content</span>
1857+
<span class="expanded">Hide code cell content</span>
1858+
</summary>
1859+
<div class="cell_input docutils container">
1860+
<div class="highlight-ipython3 notranslate"><div class="highlight"><pre><span></span><span class="o">!</span>pip<span class="w"> </span>install<span class="w"> </span>-U<span class="w"> </span><span class="s1">&#39;pyspark[sql]&#39;</span>
1861+
</pre></div>
1862+
</div>
1863+
</div>
1864+
</details>
1865+
</div>
1866+
<div class="cell tag_hide-cell docutils container">
1867+
<details class="hide above-input">
1868+
<summary aria-label="Toggle hidden content">
1869+
<span class="collapsed">Show code cell content</span>
1870+
<span class="expanded">Hide code cell content</span>
1871+
</summary>
1872+
<div class="cell_input docutils container">
1873+
<div class="highlight-ipython3 notranslate"><div class="highlight"><pre><span></span><span class="kn">from</span> <span class="nn">pyspark.sql</span> <span class="kn">import</span> <span class="n">SparkSession</span>
1874+
1875+
<span class="c1"># Create SparkSession</span>
1876+
<span class="n">spark</span> <span class="o">=</span> <span class="n">SparkSession</span><span class="o">.</span><span class="n">builder</span><span class="o">.</span><span class="n">getOrCreate</span><span class="p">()</span>
1877+
</pre></div>
1878+
</div>
1879+
</div>
1880+
</details>
1881+
</div>
1882+
<p>In PySpark, temporary views enable SQL query operations on a DataFrame. They are stored in memory, resulting in faster query times compared to accessing the original DataFrame directly.</p>
1883+
<p>To demonstrate this, let’s create a PySpark DataFrame called <code class="docutils literal notranslate"><span class="pre">orders_df</span></code>.</p>
1884+
<div class="cell docutils container">
1885+
<div class="cell_input docutils container">
1886+
<div class="highlight-ipython3 notranslate"><div class="highlight"><pre><span></span><span class="c1"># Create a sample DataFrame</span>
1887+
<span class="n">data</span> <span class="o">=</span> <span class="p">[</span>
1888+
<span class="p">(</span><span class="mi">1001</span><span class="p">,</span> <span class="s2">&quot;John Doe&quot;</span><span class="p">,</span> <span class="mf">500.0</span><span class="p">),</span>
1889+
<span class="p">(</span><span class="mi">1002</span><span class="p">,</span> <span class="s2">&quot;Jane Smith&quot;</span><span class="p">,</span> <span class="mf">750.0</span><span class="p">),</span>
1890+
<span class="p">(</span><span class="mi">1003</span><span class="p">,</span> <span class="s2">&quot;Bob Johnson&quot;</span><span class="p">,</span> <span class="mf">300.0</span><span class="p">),</span>
1891+
<span class="p">(</span><span class="mi">1004</span><span class="p">,</span> <span class="s2">&quot;Sarah Lee&quot;</span><span class="p">,</span> <span class="mf">400.0</span><span class="p">),</span>
1892+
<span class="p">(</span><span class="mi">1005</span><span class="p">,</span> <span class="s2">&quot;Tom Wilson&quot;</span><span class="p">,</span> <span class="mf">600.0</span><span class="p">),</span>
1893+
<span class="p">]</span>
1894+
1895+
<span class="n">columns</span> <span class="o">=</span> <span class="p">[</span><span class="s2">&quot;customer_id&quot;</span><span class="p">,</span> <span class="s2">&quot;customer_name&quot;</span><span class="p">,</span> <span class="s2">&quot;revenue&quot;</span><span class="p">]</span>
1896+
<span class="n">orders_df</span> <span class="o">=</span> <span class="n">spark</span><span class="o">.</span><span class="n">createDataFrame</span><span class="p">(</span><span class="n">data</span><span class="p">,</span> <span class="n">columns</span><span class="p">)</span>
1897+
</pre></div>
1898+
</div>
1899+
</div>
1900+
</div>
1901+
<p>Next, create a temporary view called <code class="docutils literal notranslate"><span class="pre">orders</span></code> from the <code class="docutils literal notranslate"><span class="pre">orders_df</span></code> DataFrame using the <code class="docutils literal notranslate"><span class="pre">createOrReplaceTempView</span></code> method.</p>
1902+
<div class="cell docutils container">
1903+
<div class="cell_input docutils container">
1904+
<div class="highlight-ipython3 notranslate"><div class="highlight"><pre><span></span><span class="c1"># Create a temporary view</span>
1905+
<span class="n">orders_df</span><span class="o">.</span><span class="n">createOrReplaceTempView</span><span class="p">(</span><span class="s2">&quot;orders&quot;</span><span class="p">)</span>
1906+
</pre></div>
1907+
</div>
1908+
</div>
1909+
</div>
1910+
<p>With the temporary view created, we can perform various operations on it using SQL queries.</p>
1911+
<div class="cell docutils container">
1912+
<div class="cell_input docutils container">
1913+
<div class="highlight-ipython3 notranslate"><div class="highlight"><pre><span></span><span class="c1"># Perform operations on the temporary view</span>
1914+
<span class="n">total_revenue</span> <span class="o">=</span> <span class="n">spark</span><span class="o">.</span><span class="n">sql</span><span class="p">(</span><span class="s2">&quot;SELECT SUM(revenue) AS total_revenue FROM orders&quot;</span><span class="p">)</span>
1915+
<span class="n">top_customers</span> <span class="o">=</span> <span class="n">spark</span><span class="o">.</span><span class="n">sql</span><span class="p">(</span>
1916+
<span class="s2">&quot;SELECT customer_id, SUM(revenue) AS total_revenue FROM orders GROUP BY customer_id ORDER BY total_revenue DESC LIMIT 10&quot;</span>
1917+
<span class="p">)</span>
1918+
<span class="n">order_count</span> <span class="o">=</span> <span class="n">spark</span><span class="o">.</span><span class="n">sql</span><span class="p">(</span><span class="s2">&quot;SELECT COUNT(*) AS order_count FROM orders&quot;</span><span class="p">)</span>
1919+
1920+
<span class="c1"># Display the results</span>
1921+
<span class="nb">print</span><span class="p">(</span><span class="s2">&quot;Total Revenue:&quot;</span><span class="p">)</span>
1922+
<span class="n">total_revenue</span><span class="o">.</span><span class="n">show</span><span class="p">()</span>
1923+
1924+
<span class="nb">print</span><span class="p">(</span><span class="s2">&quot;</span><span class="se">\n</span><span class="s2">Top 10 Customers by Total Revenue:&quot;</span><span class="p">)</span>
1925+
<span class="n">top_customers</span><span class="o">.</span><span class="n">show</span><span class="p">()</span>
1926+
1927+
<span class="nb">print</span><span class="p">(</span><span class="s2">&quot;</span><span class="se">\n</span><span class="s2">Number of Orders:&quot;</span><span class="p">)</span>
1928+
<span class="n">order_count</span><span class="o">.</span><span class="n">show</span><span class="p">()</span>
1929+
</pre></div>
1930+
</div>
1931+
</div>
1932+
<div class="cell_output docutils container">
1933+
<div class="output stream highlight-myst-ansi notranslate"><div class="highlight"><pre><span></span>Total Revenue:
1934+
+-------------+
1935+
|total_revenue|
1936+
+-------------+
1937+
| 2550.0|
1938+
+-------------+
1939+
1940+
1941+
Top 10 Customers by Total Revenue:
1942+
+-----------+-------------+
1943+
|customer_id|total_revenue|
1944+
+-----------+-------------+
1945+
| 1002| 750.0|
1946+
| 1005| 600.0|
1947+
| 1001| 500.0|
1948+
| 1004| 400.0|
1949+
| 1003| 300.0|
1950+
+-----------+-------------+
1951+
1952+
1953+
Number of Orders:
1954+
+-----------+
1955+
|order_count|
1956+
+-----------+
1957+
| 5|
1958+
+-----------+
1959+
</pre></div>
1960+
</div>
1961+
</div>
1962+
</div>
1963+
</section>
18501964
</section>
18511965

18521966
<script type="text/x-thebe-config">
@@ -1922,6 +2036,7 @@ <h2><span class="section-number">6.15.10. </span>Optimizing PySpark Queries: Dat
19222036
<li class="toc-h2 nav-item toc-entry"><a class="reference internal nav-link" href="#update-multiple-columns-in-spark-3-3-and-later">6.15.8. Update Multiple Columns in Spark 3.3 and Later</a></li>
19232037
<li class="toc-h2 nav-item toc-entry"><a class="reference internal nav-link" href="#vectorized-operations-in-pyspark-pandas-udf-vs-standard-udf">6.15.9. Vectorized Operations in PySpark: pandas_udf vs Standard UDF</a></li>
19242038
<li class="toc-h2 nav-item toc-entry"><a class="reference internal nav-link" href="#optimizing-pyspark-queries-dataframe-api-or-sql">6.15.10. Optimizing PySpark Queries: DataFrame API or SQL?</a></li>
2039+
<li class="toc-h2 nav-item toc-entry"><a class="reference internal nav-link" href="#enhance-code-modularity-and-reusability-with-temporary-views-in-pyspark">6.15.11. Enhance Code Modularity and Reusability with Temporary Views in PySpark</a></li>
19252040
</ul>
19262041
</nav></div>
19272042

0 commit comments

Comments
 (0)