You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: Chapter5/spark.ipynb
+155Lines changed: 155 additions & 0 deletions
Original file line number
Diff line number
Diff line change
@@ -1951,6 +1951,161 @@
1951
1951
"- **Familiarity**: Use spark.sql() if your team prefers SQL syntax. Use the DataFrame API if chained method calls are more intuitive for your team.\n",
1952
1952
"- **Complexity of Transformations**: The DataFrame API is more flexible for complex manipulations, while SQL is more concise for simpler queries."
1953
1953
]
1954
+
},
1955
+
{
1956
+
"cell_type": "markdown",
1957
+
"id": "f7bb85d6",
1958
+
"metadata": {},
1959
+
"source": [
1960
+
"### Enhance Code Modularity and Reusability with Temporary Views in PySpark"
1961
+
]
1962
+
},
1963
+
{
1964
+
"cell_type": "code",
1965
+
"execution_count": null,
1966
+
"id": "511c6792",
1967
+
"metadata": {
1968
+
"tags": [
1969
+
"hide-cell"
1970
+
]
1971
+
},
1972
+
"outputs": [],
1973
+
"source": [
1974
+
"!pip install -U 'pyspark[sql]'\n"
1975
+
]
1976
+
},
1977
+
{
1978
+
"cell_type": "code",
1979
+
"execution_count": 17,
1980
+
"id": "9ab976de",
1981
+
"metadata": {
1982
+
"tags": [
1983
+
"hide-cell"
1984
+
]
1985
+
},
1986
+
"outputs": [],
1987
+
"source": [
1988
+
"from pyspark.sql import SparkSession\n",
1989
+
"\n",
1990
+
"# Create SparkSession\n",
1991
+
"spark = SparkSession.builder.getOrCreate()"
1992
+
]
1993
+
},
1994
+
{
1995
+
"cell_type": "markdown",
1996
+
"id": "2ac1a0a7",
1997
+
"metadata": {},
1998
+
"source": [
1999
+
"In PySpark, temporary views enable SQL query operations on a DataFrame. They are stored in memory, resulting in faster query times compared to accessing the original DataFrame directly.\n",
2000
+
"\n",
2001
+
"To demonstrate this, let's create a PySpark DataFrame called `orders_df`."
Copy file name to clipboardExpand all lines: docs/Chapter5/spark.html
+115Lines changed: 115 additions & 0 deletions
Original file line number
Diff line number
Diff line change
@@ -523,6 +523,7 @@ <h2> Contents </h2>
523
523
<liclass="toc-h2 nav-item toc-entry"><aclass="reference internal nav-link" href="#update-multiple-columns-in-spark-3-3-and-later">6.15.8. Update Multiple Columns in Spark 3.3 and Later</a></li>
524
524
<liclass="toc-h2 nav-item toc-entry"><aclass="reference internal nav-link" href="#vectorized-operations-in-pyspark-pandas-udf-vs-standard-udf">6.15.9. Vectorized Operations in PySpark: pandas_udf vs Standard UDF</a></li>
525
525
<liclass="toc-h2 nav-item toc-entry"><aclass="reference internal nav-link" href="#optimizing-pyspark-queries-dataframe-api-or-sql">6.15.10. Optimizing PySpark Queries: DataFrame API or SQL?</a></li>
526
+
<liclass="toc-h2 nav-item toc-entry"><aclass="reference internal nav-link" href="#enhance-code-modularity-and-reusability-with-temporary-views-in-pyspark">6.15.11. Enhance Code Modularity and Reusability with Temporary Views in PySpark</a></li>
526
527
</ul>
527
528
</nav>
528
529
</div>
@@ -1847,6 +1848,119 @@ <h2><span class="section-number">6.15.10. </span>Optimizing PySpark Queries: Dat
1847
1848
<li><p><strong>Complexity of Transformations</strong>: The DataFrame API is more flexible for complex manipulations, while SQL is more concise for simpler queries.</p></li>
<h2><spanclass="section-number">6.15.11. </span>Enhance Code Modularity and Reusability with Temporary Views in PySpark<aclass="headerlink" href="#enhance-code-modularity-and-reusability-with-temporary-views-in-pyspark" title="Permalink to this heading">#</a></h2>
<p>In PySpark, temporary views enable SQL query operations on a DataFrame. They are stored in memory, resulting in faster query times compared to accessing the original DataFrame directly.</p>
1883
+
<p>To demonstrate this, let’s create a PySpark DataFrame called <codeclass="docutils literal notranslate"><spanclass="pre">orders_df</span></code>.</p>
1884
+
<divclass="cell docutils container">
1885
+
<divclass="cell_input docutils container">
1886
+
<divclass="highlight-ipython3 notranslate"><divclass="highlight"><pre><span></span><spanclass="c1"># Create a sample DataFrame</span>
<p>Next, create a temporary view called <codeclass="docutils literal notranslate"><spanclass="pre">orders</span></code> from the <codeclass="docutils literal notranslate"><spanclass="pre">orders_df</span></code> DataFrame using the <codeclass="docutils literal notranslate"><spanclass="pre">createOrReplaceTempView</span></code> method.</p>
1902
+
<divclass="cell docutils container">
1903
+
<divclass="cell_input docutils container">
1904
+
<divclass="highlight-ipython3 notranslate"><divclass="highlight"><pre><span></span><spanclass="c1"># Create a temporary view</span>
<p>With the temporary view created, we can perform various operations on it using SQL queries.</p>
1911
+
<divclass="cell docutils container">
1912
+
<divclass="cell_input docutils container">
1913
+
<divclass="highlight-ipython3 notranslate"><divclass="highlight"><pre><span></span><spanclass="c1"># Perform operations on the temporary view</span>
1914
+
<spanclass="n">total_revenue</span><spanclass="o">=</span><spanclass="n">spark</span><spanclass="o">.</span><spanclass="n">sql</span><spanclass="p">(</span><spanclass="s2">"SELECT SUM(revenue) AS total_revenue FROM orders"</span><spanclass="p">)</span>
<spanclass="s2">"SELECT customer_id, SUM(revenue) AS total_revenue FROM orders GROUP BY customer_id ORDER BY total_revenue DESC LIMIT 10"</span>
1917
+
<spanclass="p">)</span>
1918
+
<spanclass="n">order_count</span><spanclass="o">=</span><spanclass="n">spark</span><spanclass="o">.</span><spanclass="n">sql</span><spanclass="p">(</span><spanclass="s2">"SELECT COUNT(*) AS order_count FROM orders"</span><spanclass="p">)</span>
<spanclass="nb">print</span><spanclass="p">(</span><spanclass="s2">"</span><spanclass="se">\n</span><spanclass="s2">Top 10 Customers by Total Revenue:"</span><spanclass="p">)</span>
<spanclass="nb">print</span><spanclass="p">(</span><spanclass="s2">"</span><spanclass="se">\n</span><spanclass="s2">Number of Orders:"</span><spanclass="p">)</span>
@@ -1922,6 +2036,7 @@ <h2><span class="section-number">6.15.10. </span>Optimizing PySpark Queries: Dat
1922
2036
<liclass="toc-h2 nav-item toc-entry"><aclass="reference internal nav-link" href="#update-multiple-columns-in-spark-3-3-and-later">6.15.8. Update Multiple Columns in Spark 3.3 and Later</a></li>
1923
2037
<liclass="toc-h2 nav-item toc-entry"><aclass="reference internal nav-link" href="#vectorized-operations-in-pyspark-pandas-udf-vs-standard-udf">6.15.9. Vectorized Operations in PySpark: pandas_udf vs Standard UDF</a></li>
1924
2038
<liclass="toc-h2 nav-item toc-entry"><aclass="reference internal nav-link" href="#optimizing-pyspark-queries-dataframe-api-or-sql">6.15.10. Optimizing PySpark Queries: DataFrame API or SQL?</a></li>
2039
+
<liclass="toc-h2 nav-item toc-entry"><aclass="reference internal nav-link" href="#enhance-code-modularity-and-reusability-with-temporary-views-in-pyspark">6.15.11. Enhance Code Modularity and Reusability with Temporary Views in PySpark</a></li>
0 commit comments