Skip to content

Commit 242a131

Browse files
add spark
1 parent 812db9b commit 242a131

File tree

7 files changed

+1287
-70
lines changed

7 files changed

+1287
-70
lines changed

Chapter1/class.ipynb

Lines changed: 202 additions & 29 deletions
Large diffs are not rendered by default.

Chapter3/data_types.ipynb

Lines changed: 201 additions & 27 deletions
Original file line numberDiff line numberDiff line change
@@ -153,57 +153,198 @@
153153
"id": "3f6afdfd",
154154
"metadata": {},
155155
"source": [
156-
"### Reduce pandas.DataFrame’s Memory"
156+
"### Smart Data Type Selection for Memory-Efficient Pandas"
157157
]
158158
},
159159
{
160160
"cell_type": "markdown",
161161
"id": "a71e2b90",
162162
"metadata": {},
163163
"source": [
164-
"If you want to reduce the memory of your pandas DataFrame, start with changing the data type of a column. If your categorical variable has low cardinality, change the data type to category like below."
164+
"To reduce the memory usage of a Pandas DataFrame, you can start by changing the data type of a column. "
165165
]
166166
},
167167
{
168168
"cell_type": "code",
169-
"execution_count": 7,
169+
"execution_count": 42,
170170
"id": "bb5a60a6",
171171
"metadata": {
172172
"ExecuteTime": {
173173
"end_time": "2021-11-18T14:28:51.041729Z",
174174
"start_time": "2021-11-18T14:28:51.013221Z"
175175
}
176176
},
177+
"outputs": [
178+
{
179+
"name": "stdout",
180+
"output_type": "stream",
181+
"text": [
182+
"<class 'pandas.core.frame.DataFrame'>\n",
183+
"RangeIndex: 150 entries, 0 to 149\n",
184+
"Data columns (total 5 columns):\n",
185+
" # Column Non-Null Count Dtype \n",
186+
"--- ------ -------------- ----- \n",
187+
" 0 sepal length (cm) 150 non-null float64\n",
188+
" 1 sepal width (cm) 150 non-null float64\n",
189+
" 2 petal length (cm) 150 non-null float64\n",
190+
" 3 petal width (cm) 150 non-null float64\n",
191+
" 4 target 150 non-null int64 \n",
192+
"dtypes: float64(4), int64(1)\n",
193+
"memory usage: 6.0 KB\n"
194+
]
195+
}
196+
],
197+
"source": [
198+
"from sklearn.datasets import load_iris\n",
199+
"import pandas as pd \n",
200+
"\n",
201+
"X, y = load_iris(as_frame=True, return_X_y=True)\n",
202+
"df = pd.concat([X, pd.DataFrame(y, columns=[\"target\"])], axis=1)\n",
203+
"df.info()"
204+
]
205+
},
206+
{
207+
"cell_type": "markdown",
208+
"id": "e7a44662-d675-46d1-b860-ce65fec1aeab",
209+
"metadata": {},
210+
"source": [
211+
"By default, Pandas uses `float64` for floating-point numbers, which can be oversized for columns with smaller value ranges. Here are some alternatives:\n",
212+
"\n",
213+
"- **float16**: Suitable for values between -32768 and 32767.\n",
214+
"- **float32**: Suitable for integers between -2147483648 and 2147483647.\n",
215+
"- **float64**: The default, suitable for a wide range of values.\n",
216+
"\n",
217+
"For example, if you know that the values in the \"sepal length (cm)\" column will never exceed 32767, you can use `float16` to reduce memory usage."
218+
]
219+
},
220+
{
221+
"cell_type": "code",
222+
"execution_count": 44,
223+
"id": "dbfae785-f316-4e8d-b428-81e05a8da1dc",
224+
"metadata": {},
177225
"outputs": [
178226
{
179227
"data": {
180228
"text/plain": [
181-
"Index 128\n",
182-
"sepal length (cm) 1200\n",
183-
"sepal width (cm) 1200\n",
184-
"petal length (cm) 1200\n",
185-
"petal width (cm) 1200\n",
186-
"target 1200\n",
187-
"dtype: int64"
229+
"7.9"
188230
]
189231
},
190-
"execution_count": 7,
232+
"execution_count": 44,
191233
"metadata": {},
192234
"output_type": "execute_result"
193235
}
194236
],
195237
"source": [
196-
"from sklearn.datasets import load_iris\n",
197-
"import pandas as pd \n",
238+
"df['sepal length (cm)'].max()"
239+
]
240+
},
241+
{
242+
"cell_type": "code",
243+
"execution_count": 45,
244+
"id": "a12334bd-2c33-45e8-9979-91f16c45df06",
245+
"metadata": {},
246+
"outputs": [
247+
{
248+
"data": {
249+
"text/plain": [
250+
"1332"
251+
]
252+
},
253+
"execution_count": 45,
254+
"metadata": {},
255+
"output_type": "execute_result"
256+
}
257+
],
258+
"source": [
259+
"df['sepal length (cm)'].memory_usage()"
260+
]
261+
},
262+
{
263+
"cell_type": "code",
264+
"execution_count": 46,
265+
"id": "1221e9cc-fed2-4f75-8698-9b09b89d4c0e",
266+
"metadata": {},
267+
"outputs": [
268+
{
269+
"data": {
270+
"text/plain": [
271+
"432"
272+
]
273+
},
274+
"execution_count": 46,
275+
"metadata": {},
276+
"output_type": "execute_result"
277+
}
278+
],
279+
"source": [
280+
"df[\"sepal length (cm)\"] = df[\"sepal length (cm)\"].astype(\"float16\")\n",
281+
"df['sepal length (cm)'].memory_usage()"
282+
]
283+
},
284+
{
285+
"cell_type": "markdown",
286+
"id": "fcdcaaed-a7b4-484e-8dd0-bfe766203967",
287+
"metadata": {},
288+
"source": [
289+
"Here, the memory usage of the \"sepal length (cm)\" column decreased from 1332 bytes to 432 bytes, a reduction of approximately 67.6%."
290+
]
291+
},
292+
{
293+
"cell_type": "markdown",
294+
"id": "c1d1f261-0b1a-4dd9-a3d9-fc6d742f5847",
295+
"metadata": {},
296+
"source": [
297+
"If you have a categorical variable with low cardinality, you can change its data type to `category` to reduce memory usage.\n",
198298
"\n",
199-
"X, y = load_iris(as_frame=True, return_X_y=True)\n",
200-
"df = pd.concat([X, pd.DataFrame(y, columns=[\"target\"])], axis=1)\n",
201-
"df.memory_usage()"
299+
"The \"target\" column has only 3 unique values, making it a good candidate for the category data type to save memory."
300+
]
301+
},
302+
{
303+
"cell_type": "code",
304+
"execution_count": 48,
305+
"id": "6b1769e9-61f4-4a1d-a0d2-ffc30567c722",
306+
"metadata": {},
307+
"outputs": [
308+
{
309+
"data": {
310+
"text/plain": [
311+
"3"
312+
]
313+
},
314+
"execution_count": 48,
315+
"metadata": {},
316+
"output_type": "execute_result"
317+
}
318+
],
319+
"source": [
320+
"# View category\n",
321+
"df['target'].nunique()"
322+
]
323+
},
324+
{
325+
"cell_type": "code",
326+
"execution_count": 30,
327+
"id": "d236a672-3485-4503-a7d6-849c2fc6dfed",
328+
"metadata": {},
329+
"outputs": [
330+
{
331+
"data": {
332+
"text/plain": [
333+
"1332"
334+
]
335+
},
336+
"execution_count": 30,
337+
"metadata": {},
338+
"output_type": "execute_result"
339+
}
340+
],
341+
"source": [
342+
"df['target'].memory_usage()"
202343
]
203344
},
204345
{
205346
"cell_type": "code",
206-
"execution_count": 5,
347+
"execution_count": 38,
207348
"id": "a770da2a",
208349
"metadata": {
209350
"ExecuteTime": {
@@ -215,31 +356,64 @@
215356
{
216357
"data": {
217358
"text/plain": [
218-
"Index 128\n",
219-
"sepal length (cm) 1200\n",
220-
"sepal width (cm) 1200\n",
221-
"petal length (cm) 1200\n",
222-
"petal width (cm) 1200\n",
223-
"target 282\n",
224-
"dtype: int64"
359+
"414"
225360
]
226361
},
227-
"execution_count": 5,
362+
"execution_count": 38,
228363
"metadata": {},
229364
"output_type": "execute_result"
230365
}
231366
],
232367
"source": [
233368
"df[\"target\"] = df[\"target\"].astype(\"category\")\n",
234-
"df.memory_usage()"
369+
"df['target'].memory_usage()"
235370
]
236371
},
237372
{
238373
"cell_type": "markdown",
239374
"id": "2f78876c",
240375
"metadata": {},
241376
"source": [
242-
"The memory is now is reduced to almost a fifth of what it was!"
377+
"Here, the memory usage of the \"target\" column decreased from 1332 bytes to 414 bytes, a reduction of approximately 68.9%."
378+
]
379+
},
380+
{
381+
"cell_type": "markdown",
382+
"id": "d416217a-75f0-4ba3-be38-65a1386fc288",
383+
"metadata": {},
384+
"source": [
385+
"If we apply this reduction to the rest of the columns, the memory usage of the DataFrame decreased from 6.0 KB to 1.6 KB, a reduction of approximately 73.3%."
386+
]
387+
},
388+
{
389+
"cell_type": "code",
390+
"execution_count": 32,
391+
"id": "95737307-6680-4dfe-a0aa-bf1629e981d8",
392+
"metadata": {},
393+
"outputs": [
394+
{
395+
"name": "stdout",
396+
"output_type": "stream",
397+
"text": [
398+
"<class 'pandas.core.frame.DataFrame'>\n",
399+
"RangeIndex: 150 entries, 0 to 149\n",
400+
"Data columns (total 5 columns):\n",
401+
" # Column Non-Null Count Dtype \n",
402+
"--- ------ -------------- ----- \n",
403+
" 0 sepal length (cm) 150 non-null float16 \n",
404+
" 1 sepal width (cm) 150 non-null float16 \n",
405+
" 2 petal length (cm) 150 non-null float16 \n",
406+
" 3 petal width (cm) 150 non-null float16 \n",
407+
" 4 target 150 non-null category\n",
408+
"dtypes: category(1), float16(4)\n",
409+
"memory usage: 1.6 KB\n"
410+
]
411+
}
412+
],
413+
"source": [
414+
"float_cols = df.select_dtypes(include=['float64']).columns\n",
415+
"df[float_cols] = df[float_cols].apply(lambda x: x.astype('float16'))\n",
416+
"df.info()"
243417
]
244418
},
245419
{

Chapter5/SQL.ipynb

Lines changed: 22 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -379,14 +379,13 @@
379379
]
380380
},
381381
{
382-
"attachments": {},
383382
"cell_type": "markdown",
384-
"id": "c421db69",
383+
"id": "4b9c9bed-740f-451e-8b9e-9a66c646f748",
385384
"metadata": {},
386385
"source": [
387-
"Linting helps ensure that code follows consistent style conventions, making it easier to understand and maintain. With SQLFluff, you can automatically lint your SQL code and correct most linting errors, freeing you up to focus on more important tasks.\n",
386+
"Inconsistent SQL formatting and style errors can reduce code readability and make code reviews more difficult.\n",
388387
"\n",
389-
"SQLFluff supports various SQL dialects such as ANSI, MySQL, PostgreSQL, BigQuery, Databricks, Oracle, Teradata, etc."
388+
"SQLFluff solves this problem by automatically linting and fixing SQL code formatting issues across multiple dialects, including ANSI, MySQL, PostgreSQL, BigQuery, Databricks, Oracle, and more."
390389
]
391390
},
392391
{
@@ -395,7 +394,7 @@
395394
"id": "4efc819a",
396395
"metadata": {},
397396
"source": [
398-
"In the code below, we use SQLFLuff to lint and fix the SQL code in the file `sqlfluff_example.sql`."
397+
"To demonstrate, let's create a sample SQL file named `sqlfluff_example.sql` with a simple `SELECT` statement."
399398
]
400399
},
401400
{
@@ -416,6 +415,8 @@
416415
"id": "b6f66b3c",
417416
"metadata": {},
418417
"source": [
418+
"Next, run the SQLFluff linter on the `sqlfluff_example.sql` file using the PostgreSQL dialect. \n",
419+
"\n",
419420
"```bash\n",
420421
"$ sqlfluff lint sqlfluff_example.sql --dialect postgres\n",
421422
"```"
@@ -460,6 +461,14 @@
460461
"!sqlfluff lint sqlfluff_example.sql --dialect postgres"
461462
]
462463
},
464+
{
465+
"cell_type": "markdown",
466+
"id": "bf29c68a-2e54-4537-8a7c-d588182a4ada",
467+
"metadata": {},
468+
"source": [
469+
"To fix the style errors and inconsistencies found by the linter, we can run the `fix` command."
470+
]
471+
},
463472
{
464473
"attachments": {},
465474
"cell_type": "markdown",
@@ -492,6 +501,14 @@
492501
"%cat sqlfluff_example.sql"
493502
]
494503
},
504+
{
505+
"cell_type": "markdown",
506+
"id": "4844c6bb-1c83-4fab-8563-475ced95ab24",
507+
"metadata": {},
508+
"source": [
509+
"Now, the SQL code is formatted and readable."
510+
]
511+
},
495512
{
496513
"attachments": {},
497514
"cell_type": "markdown",

0 commit comments

Comments
 (0)