Skip to content

Commit 0e74fd6

Browse files
add pyspark
1 parent acc215e commit 0e74fd6

14 files changed

+2833
-26
lines changed

Chapter5/natural_language_processing.ipynb

Lines changed: 802 additions & 0 deletions
Large diffs are not rendered by default.

Chapter5/spark.ipynb

Lines changed: 177 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,13 +1,188 @@
11
{
22
"cells": [
33
{
4-
"cell_type": "markdown",
5-
"id": "b25228f0",
4+
"cell_type": "raw",
5+
"id": "af6530d4-d251-4240-90f6-eed4704a0a1a",
66
"metadata": {},
77
"source": [
88
"## PySpark"
99
]
1010
},
11+
{
12+
"cell_type": "markdown",
13+
"id": "06ae6e73-bfad-45fb-b338-048da0c0c789",
14+
"metadata": {},
15+
"source": [
16+
"## 3 Powerful Ways to Create PySpark DataFrames"
17+
]
18+
},
19+
{
20+
"cell_type": "code",
21+
"execution_count": null,
22+
"id": "66e1b5d0",
23+
"metadata": {},
24+
"outputs": [],
25+
"source": [
26+
"from pyspark.sql import SparkSession\n",
27+
"\n",
28+
"spark = SparkSession.builder.getOrCreate()"
29+
]
30+
},
31+
{
32+
"cell_type": "markdown",
33+
"id": "08648f09-21cd-42d0-8b0f-be04fa7e2002",
34+
"metadata": {},
35+
"source": [
36+
"Here are the three powerful methods to create DataFrames in PySpark, each with its own advantages:"
37+
]
38+
},
39+
{
40+
"cell_type": "markdown",
41+
"id": "b35944a8-7824-4971-9cc5-cf847c5269fb",
42+
"metadata": {},
43+
"source": [
44+
"1. Using StructType and StructField:"
45+
]
46+
},
47+
{
48+
"cell_type": "code",
49+
"execution_count": 6,
50+
"id": "a16e73a8",
51+
"metadata": {},
52+
"outputs": [
53+
{
54+
"name": "stdout",
55+
"output_type": "stream",
56+
"text": [
57+
"+-------+---+\n",
58+
"| name|age|\n",
59+
"+-------+---+\n",
60+
"| Alice| 25|\n",
61+
"| Bob| 30|\n",
62+
"|Charlie| 35|\n",
63+
"+-------+---+\n",
64+
"\n"
65+
]
66+
}
67+
],
68+
"source": [
69+
"from pyspark.sql.types import StructType, StructField, StringType, IntegerType\n",
70+
"\n",
71+
"\n",
72+
"data = [(\"Alice\", 25), (\"Bob\", 30), (\"Charlie\", 35)]\n",
73+
"schema = StructType(\n",
74+
" [StructField(\"name\", StringType(), True), StructField(\"age\", IntegerType(), True)]\n",
75+
")\n",
76+
"\n",
77+
"df = spark.createDataFrame(data, schema)\n",
78+
"df.show()"
79+
]
80+
},
81+
{
82+
"cell_type": "markdown",
83+
"id": "d6db8d65-4aa9-4f2a-bbf1-2a85e62b987a",
84+
"metadata": {},
85+
"source": [
86+
"Pros:\n",
87+
"- Explicit schema definition, giving you full control over data types\n",
88+
"- Helps catch data type mismatches early\n",
89+
"- Ideal when you need to ensure data consistency and type safety\n",
90+
"- Can improve performance by avoiding schema inference"
91+
]
92+
},
93+
{
94+
"cell_type": "markdown",
95+
"id": "9ee5ab77-dd71-4e83-bf66-f7b5704ead09",
96+
"metadata": {},
97+
"source": [
98+
"2. Using Row objects:"
99+
]
100+
},
101+
{
102+
"cell_type": "code",
103+
"execution_count": 5,
104+
"id": "bfca4bd7",
105+
"metadata": {},
106+
"outputs": [
107+
{
108+
"name": "stdout",
109+
"output_type": "stream",
110+
"text": [
111+
"+-------+---+\n",
112+
"| name|age|\n",
113+
"+-------+---+\n",
114+
"| Alice| 25|\n",
115+
"| Bob| 30|\n",
116+
"|Charlie| 35|\n",
117+
"+-------+---+\n",
118+
"\n"
119+
]
120+
}
121+
],
122+
"source": [
123+
"from pyspark.sql import Row\n",
124+
"\n",
125+
"data = [Row(name=\"Alice\", age=25), Row(name=\"Bob\", age=30), Row(name=\"Charlie\", age=35)]\n",
126+
"df = spark.createDataFrame(data)\n",
127+
"df.show()"
128+
]
129+
},
130+
{
131+
"cell_type": "markdown",
132+
"id": "8812e9a0-c54d-44f4-8300-5ec0bdf53061",
133+
"metadata": {},
134+
"source": [
135+
"Pros:\n",
136+
"- More Pythonic approach, leveraging named tuples\n",
137+
"- Good for scenarios where data structure might evolve"
138+
]
139+
},
140+
{
141+
"cell_type": "markdown",
142+
"id": "ef78d9a3-cd5a-44bb-a1d9-155e67c3743f",
143+
"metadata": {},
144+
"source": [
145+
"3. From Pandas DataFrame:"
146+
]
147+
},
148+
{
149+
"cell_type": "code",
150+
"execution_count": 6,
151+
"id": "9f8050dc",
152+
"metadata": {},
153+
"outputs": [
154+
{
155+
"name": "stdout",
156+
"output_type": "stream",
157+
"text": [
158+
"+-------+---+\n",
159+
"| name|age|\n",
160+
"+-------+---+\n",
161+
"| Alice| 25|\n",
162+
"| Bob| 30|\n",
163+
"|Charlie| 35|\n",
164+
"+-------+---+\n",
165+
"\n"
166+
]
167+
}
168+
],
169+
"source": [
170+
"import pandas as pd\n",
171+
"\n",
172+
"pandas_df = pd.DataFrame({\"name\": [\"Alice\", \"Bob\", \"Charlie\"], \"age\": [25, 30, 35]})\n",
173+
"df = spark.createDataFrame(pandas_df)\n",
174+
"df.show()"
175+
]
176+
},
177+
{
178+
"cell_type": "markdown",
179+
"id": "aaf54d83-69a5-47ec-b0c7-bf435e08fc5d",
180+
"metadata": {},
181+
"source": [
182+
"Pros:\n",
183+
"- Familiar to data scientists who frequently use Pandas"
184+
]
185+
},
11186
{
12187
"cell_type": "markdown",
13188
"id": "8edc16c3",

docs/Chapter1/Chapter1.html

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -271,7 +271,7 @@
271271
<li class="toctree-l2"><a class="reference internal" href="../Chapter5/better_pandas.html">6.12. Better Pandas</a></li>
272272
<li class="toctree-l2"><a class="reference internal" href="../Chapter5/testing.html">6.13. Testing</a></li>
273273
<li class="toctree-l2"><a class="reference internal" href="../Chapter5/SQL.html">6.14. SQL Libraries</a></li>
274-
<li class="toctree-l2"><a class="reference internal" href="../Chapter5/spark.html">6.15. PySpark</a></li>
274+
<li class="toctree-l2"><a class="reference internal" href="../Chapter5/spark.html">6.15. 3 Powerful Ways to Create PySpark DataFrames</a></li>
275275
<li class="toctree-l2"><a class="reference internal" href="../Chapter5/llm.html">6.16. Large Language Model (LLM)</a></li>
276276
</ul>
277277
</li>

docs/Chapter1/set.html

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -271,7 +271,7 @@
271271
<li class="toctree-l2"><a class="reference internal" href="../Chapter5/better_pandas.html">6.12. Better Pandas</a></li>
272272
<li class="toctree-l2"><a class="reference internal" href="../Chapter5/testing.html">6.13. Testing</a></li>
273273
<li class="toctree-l2"><a class="reference internal" href="../Chapter5/SQL.html">6.14. SQL Libraries</a></li>
274-
<li class="toctree-l2"><a class="reference internal" href="../Chapter5/spark.html">6.15. PySpark</a></li>
274+
<li class="toctree-l2"><a class="reference internal" href="../Chapter5/spark.html">6.15. 3 Powerful Ways to Create PySpark DataFrames</a></li>
275275
<li class="toctree-l2"><a class="reference internal" href="../Chapter5/llm.html">6.16. Large Language Model (LLM)</a></li>
276276
</ul>
277277
</li>

docs/Chapter5/Chapter5.html

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -271,7 +271,7 @@
271271
<li class="toctree-l2"><a class="reference internal" href="better_pandas.html">6.12. Better Pandas</a></li>
272272
<li class="toctree-l2"><a class="reference internal" href="testing.html">6.13. Testing</a></li>
273273
<li class="toctree-l2"><a class="reference internal" href="SQL.html">6.14. SQL Libraries</a></li>
274-
<li class="toctree-l2"><a class="reference internal" href="spark.html">6.15. PySpark</a></li>
274+
<li class="toctree-l2"><a class="reference internal" href="spark.html">6.15. 3 Powerful Ways to Create PySpark DataFrames</a></li>
275275
<li class="toctree-l2"><a class="reference internal" href="llm.html">6.16. Large Language Model (LLM)</a></li>
276276
</ul>
277277
</li>

0 commit comments

Comments
 (0)