Skip to content

Commit 2112273

Browse files
add delta lake vs parquet
1 parent 04b4bcf commit 2112273

15 files changed

+1005
-528
lines changed

Chapter5/better_pandas.ipynb

Lines changed: 115 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -1226,7 +1226,115 @@
12261226
"id": "5bd148cd",
12271227
"metadata": {},
12281228
"source": [
1229-
"[Link to delta-rs](https://github.com/delta-io/delta-rs)."
1229+
"[Link to Delta Lake](https://github.com/delta-io/delta-rs)."
1230+
]
1231+
},
1232+
{
1233+
"cell_type": "markdown",
1234+
"id": "226389ac-1506-458b-b28d-49ab486b86d4",
1235+
"metadata": {},
1236+
"source": [
1237+
"### Beyond Parquet: Reliable Data Storage with Delta Lake"
1238+
]
1239+
},
1240+
{
1241+
"cell_type": "markdown",
1242+
"id": "7bca5dcd-5336-470f-9790-0a1f52cc65dd",
1243+
"metadata": {},
1244+
"source": [
1245+
"Traditional data storage methods, such as plain Parquet files, are susceptible to partial failures during write operations. This can result in incomplete data files and a lack of clear recovery options in the event of a system crash.\n",
1246+
"\n",
1247+
"Delta Lake's write operation with ACID transactions helps solve this by:\n",
1248+
"- Ensuring either all data is written successfully or none of it is\n",
1249+
"- Maintaining a transaction log that tracks all changes\n",
1250+
"- Providing time travel capabilities to recover from failures"
1251+
]
1252+
},
1253+
{
1254+
"cell_type": "markdown",
1255+
"id": "99289400-1697-4457-b98e-e8f132739fca",
1256+
"metadata": {},
1257+
"source": [
1258+
"Here's an example showing Delta Lake's reliable write operation:"
1259+
]
1260+
},
1261+
{
1262+
"cell_type": "code",
1263+
"execution_count": 9,
1264+
"id": "8a510b91-911a-4135-949b-f2dff63e852d",
1265+
"metadata": {},
1266+
"outputs": [],
1267+
"source": [
1268+
"from deltalake import write_deltalake, DeltaTable\n",
1269+
"import pandas as pd\n",
1270+
"\n",
1271+
"initial_data = pd.DataFrame({\n",
1272+
" \"id\": [1, 2],\n",
1273+
" \"value\": [\"a\", \"b\"]\n",
1274+
"})\n",
1275+
"\n",
1276+
"write_deltalake(\"customers\", initial_data)"
1277+
]
1278+
},
1279+
{
1280+
"cell_type": "markdown",
1281+
"id": "05b5bb11-2cda-4578-b858-e4cd94a02785",
1282+
"metadata": {},
1283+
"source": [
1284+
"If the append operation fails halfway, Delta Lake's transaction log ensures that the table remains in its last valid state. "
1285+
]
1286+
},
1287+
{
1288+
"cell_type": "code",
1289+
"execution_count": 10,
1290+
"id": "e7a63193-804a-48ce-a548-494f6442bbf3",
1291+
"metadata": {},
1292+
"outputs": [
1293+
{
1294+
"name": "stdout",
1295+
"output_type": "stream",
1296+
"text": [
1297+
"Write failed: System crash during append!\n",
1298+
"\n",
1299+
"Table state after failed append:\n",
1300+
" id value\n",
1301+
"0 1 a\n",
1302+
"1 2 b\n",
1303+
"\n",
1304+
"Current version: 0\n"
1305+
]
1306+
}
1307+
],
1308+
"source": [
1309+
"try:\n",
1310+
" # Simulate a large append that fails halfway\n",
1311+
" new_data = pd.DataFrame({\n",
1312+
" \"id\": range(3, 1003), # 1000 new rows\n",
1313+
" \"value\": [\"error\"] * 1000\n",
1314+
" })\n",
1315+
" \n",
1316+
" # Simulate system crash during append\n",
1317+
" raise Exception(\"System crash during append!\")\n",
1318+
" write_deltalake(\"customers\", new_data, mode=\"append\")\n",
1319+
" \n",
1320+
"except Exception as e:\n",
1321+
" print(f\"Write failed: {e}\")\n",
1322+
" \n",
1323+
" # Check table state - still contains only initial data\n",
1324+
" dt = DeltaTable(\"customers\")\n",
1325+
" print(\"\\nTable state after failed append:\")\n",
1326+
" print(dt.to_pandas())\n",
1327+
" \n",
1328+
" # Verify version history\n",
1329+
" print(f\"\\nCurrent version: {dt.version()}\")"
1330+
]
1331+
},
1332+
{
1333+
"cell_type": "markdown",
1334+
"id": "824fa5ce-ddca-4677-8b66-be590b527969",
1335+
"metadata": {},
1336+
"source": [
1337+
"[Link to Delta Lake](https://github.com/delta-io/delta-rs)."
12301338
]
12311339
},
12321340
{
@@ -1449,7 +1557,7 @@
14491557
"id": "540812a4",
14501558
"metadata": {},
14511559
"source": [
1452-
"[Link to delta-rs](https://github.com/delta-io/delta-rs)."
1560+
"[Link to Delta Lake](https://github.com/delta-io/delta-rs)."
14531561
]
14541562
},
14551563
{
@@ -1815,7 +1923,7 @@
18151923
"id": "c65f16ea",
18161924
"metadata": {},
18171925
"source": [
1818-
"[Link to delta-rs](https://github.com/delta-io/delta-rs)."
1926+
"[Link to Delta Lake](https://github.com/delta-io/delta-rs)."
18191927
]
18201928
},
18211929
{
@@ -1929,7 +2037,7 @@
19292037
"id": "bd328f2c",
19302038
"metadata": {},
19312039
"source": [
1932-
"[Link to delta-rs](https://github.com/delta-io/delta-rs)."
2040+
"[Link to Delta Lake](https://github.com/delta-io/delta-rs)."
19332041
]
19342042
},
19352043
{
@@ -2109,7 +2217,7 @@
21092217
"id": "029c0b9d",
21102218
"metadata": {},
21112219
"source": [
2112-
"[Link to delta-rs](https://github.com/delta-io/delta-rs)."
2220+
"[Link to Delta Lake](https://github.com/delta-io/delta-rs)."
21132221
]
21142222
},
21152223
{
@@ -2258,7 +2366,7 @@
22582366
"id": "78c06861",
22592367
"metadata": {},
22602368
"source": [
2261-
"[Link to delta-rs](https://github.com/delta-io/delta-rs)."
2369+
"[Link to Delta Lake](https://github.com/delta-io/delta-rs)."
22622370
]
22632371
},
22642372
{
@@ -4075,7 +4183,7 @@
40754183
"source": [
40764184
"[Link to polars](https://github.com/pola-rs/polars)\n",
40774185
"\n",
4078-
"[Link to delta-rs](https://github.com/delta-io/delta-rs)."
4186+
"[Link to Delta Lake](https://github.com/delta-io/delta-rs)."
40794187
]
40804188
},
40814189
{
@@ -4710,11 +4818,6 @@
47104818
"toc_position": {},
47114819
"toc_section_display": true,
47124820
"toc_window_display": false
4713-
},
4714-
"vscode": {
4715-
"interpreter": {
4716-
"hash": "484329849bb907480cd798e750759bc6f1d66c93f9e78e7055aa0a2c2de6b47b"
4717-
}
47184821
}
47194822
},
47204823
"nbformat": 4,

Chapter5/testing.ipynb

Lines changed: 8 additions & 115 deletions
Original file line numberDiff line numberDiff line change
@@ -257,24 +257,7 @@
257257
},
258258
{
259259
"data": {
260-
"application/javascript": [
261-
"\n",
262-
" setTimeout(function() {\n",
263-
" var nbb_cell_id = 42;\n",
264-
" var nbb_unformatted_code = \"!pytest pytest_benchmark_example.py \";\n",
265-
" var nbb_formatted_code = \"!pytest pytest_benchmark_example.py\";\n",
266-
" var nbb_cells = Jupyter.notebook.get_cells();\n",
267-
" for (var i = 0; i < nbb_cells.length; ++i) {\n",
268-
" if (nbb_cells[i].input_prompt_number == nbb_cell_id) {\n",
269-
" if (nbb_cells[i].get_text() == nbb_unformatted_code) {\n",
270-
" nbb_cells[i].set_text(nbb_formatted_code);\n",
271-
" }\n",
272-
" break;\n",
273-
" }\n",
274-
" }\n",
275-
" }, 500);\n",
276-
" "
277-
],
260+
"application/javascript": "\n setTimeout(function() {\n var nbb_cell_id = 42;\n var nbb_unformatted_code = \"!pytest pytest_benchmark_example.py \";\n var nbb_formatted_code = \"!pytest pytest_benchmark_example.py\";\n var nbb_cells = Jupyter.notebook.get_cells();\n for (var i = 0; i < nbb_cells.length; ++i) {\n if (nbb_cells[i].input_prompt_number == nbb_cell_id) {\n if (nbb_cells[i].get_text() == nbb_unformatted_code) {\n nbb_cells[i].set_text(nbb_formatted_code);\n }\n break;\n }\n }\n }, 500);\n ",
278261
"text/plain": [
279262
"<IPython.core.display.Javascript object>"
280263
]
@@ -3597,24 +3580,7 @@
35973580
},
35983581
{
35993582
"data": {
3600-
"application/javascript": [
3601-
"\n",
3602-
" setTimeout(function() {\n",
3603-
" var nbb_cell_id = 25;\n",
3604-
" var nbb_unformatted_code = \"experience1 = {\\\"machine learning\\\": 2, \\\"python\\\": 3}\\nexperience2 = {\\\"ml\\\": 2, \\\"python\\\": 3}\\n\\nDeepDiff(\\n experience1,\\n experience2,\\n exclude_paths={\\\"root['ml']\\\", \\\"root['machine learning']\\\"},\\n)\";\n",
3605-
" var nbb_formatted_code = \"experience1 = {\\\"machine learning\\\": 2, \\\"python\\\": 3}\\nexperience2 = {\\\"ml\\\": 2, \\\"python\\\": 3}\\n\\nDeepDiff(\\n experience1,\\n experience2,\\n exclude_paths={\\\"root['ml']\\\", \\\"root['machine learning']\\\"},\\n)\";\n",
3606-
" var nbb_cells = Jupyter.notebook.get_cells();\n",
3607-
" for (var i = 0; i < nbb_cells.length; ++i) {\n",
3608-
" if (nbb_cells[i].input_prompt_number == nbb_cell_id) {\n",
3609-
" if (nbb_cells[i].get_text() == nbb_unformatted_code) {\n",
3610-
" nbb_cells[i].set_text(nbb_formatted_code);\n",
3611-
" }\n",
3612-
" break;\n",
3613-
" }\n",
3614-
" }\n",
3615-
" }, 500);\n",
3616-
" "
3617-
],
3583+
"application/javascript": "\n setTimeout(function() {\n var nbb_cell_id = 25;\n var nbb_unformatted_code = \"experience1 = {\\\"machine learning\\\": 2, \\\"python\\\": 3}\\nexperience2 = {\\\"ml\\\": 2, \\\"python\\\": 3}\\n\\nDeepDiff(\\n experience1,\\n experience2,\\n exclude_paths={\\\"root['ml']\\\", \\\"root['machine learning']\\\"},\\n)\";\n var nbb_formatted_code = \"experience1 = {\\\"machine learning\\\": 2, \\\"python\\\": 3}\\nexperience2 = {\\\"ml\\\": 2, \\\"python\\\": 3}\\n\\nDeepDiff(\\n experience1,\\n experience2,\\n exclude_paths={\\\"root['ml']\\\", \\\"root['machine learning']\\\"},\\n)\";\n var nbb_cells = Jupyter.notebook.get_cells();\n for (var i = 0; i < nbb_cells.length; ++i) {\n if (nbb_cells[i].input_prompt_number == nbb_cell_id) {\n if (nbb_cells[i].get_text() == nbb_unformatted_code) {\n nbb_cells[i].set_text(nbb_formatted_code);\n }\n break;\n }\n }\n }, 500);\n ",
36183584
"text/plain": [
36193585
"<IPython.core.display.Javascript object>"
36203586
]
@@ -3666,24 +3632,7 @@
36663632
},
36673633
{
36683634
"data": {
3669-
"application/javascript": [
3670-
"\n",
3671-
" setTimeout(function() {\n",
3672-
" var nbb_cell_id = 34;\n",
3673-
" var nbb_unformatted_code = \"num1 = 0.258\\nnum2 = 0.259\\n\\nDeepDiff(num1, num2, significant_digits=2)\";\n",
3674-
" var nbb_formatted_code = \"num1 = 0.258\\nnum2 = 0.259\\n\\nDeepDiff(num1, num2, significant_digits=2)\";\n",
3675-
" var nbb_cells = Jupyter.notebook.get_cells();\n",
3676-
" for (var i = 0; i < nbb_cells.length; ++i) {\n",
3677-
" if (nbb_cells[i].input_prompt_number == nbb_cell_id) {\n",
3678-
" if (nbb_cells[i].get_text() == nbb_unformatted_code) {\n",
3679-
" nbb_cells[i].set_text(nbb_formatted_code);\n",
3680-
" }\n",
3681-
" break;\n",
3682-
" }\n",
3683-
" }\n",
3684-
" }, 500);\n",
3685-
" "
3686-
],
3635+
"application/javascript": "\n setTimeout(function() {\n var nbb_cell_id = 34;\n var nbb_unformatted_code = \"num1 = 0.258\\nnum2 = 0.259\\n\\nDeepDiff(num1, num2, significant_digits=2)\";\n var nbb_formatted_code = \"num1 = 0.258\\nnum2 = 0.259\\n\\nDeepDiff(num1, num2, significant_digits=2)\";\n var nbb_cells = Jupyter.notebook.get_cells();\n for (var i = 0; i < nbb_cells.length; ++i) {\n if (nbb_cells[i].input_prompt_number == nbb_cell_id) {\n if (nbb_cells[i].get_text() == nbb_unformatted_code) {\n nbb_cells[i].set_text(nbb_formatted_code);\n }\n break;\n }\n }\n }, 500);\n ",
36873636
"text/plain": [
36883637
"<IPython.core.display.Javascript object>"
36893638
]
@@ -3997,24 +3946,7 @@
39973946
"outputs": [
39983947
{
39993948
"data": {
4000-
"application/javascript": [
4001-
"\n",
4002-
" setTimeout(function() {\n",
4003-
" var nbb_cell_id = 18;\n",
4004-
" var nbb_unformatted_code = \"from deepchecks.checks.integrity.new_category import CategoryMismatchTrainTest\\nfrom deepchecks.base import Dataset\\nimport pandas as pd\";\n",
4005-
" var nbb_formatted_code = \"from deepchecks.checks.integrity.new_category import CategoryMismatchTrainTest\\nfrom deepchecks.base import Dataset\\nimport pandas as pd\";\n",
4006-
" var nbb_cells = Jupyter.notebook.get_cells();\n",
4007-
" for (var i = 0; i < nbb_cells.length; ++i) {\n",
4008-
" if (nbb_cells[i].input_prompt_number == nbb_cell_id) {\n",
4009-
" if (nbb_cells[i].get_text() == nbb_unformatted_code) {\n",
4010-
" nbb_cells[i].set_text(nbb_formatted_code);\n",
4011-
" }\n",
4012-
" break;\n",
4013-
" }\n",
4014-
" }\n",
4015-
" }, 500);\n",
4016-
" "
4017-
],
3949+
"application/javascript": "\n setTimeout(function() {\n var nbb_cell_id = 18;\n var nbb_unformatted_code = \"from deepchecks.checks.integrity.new_category import CategoryMismatchTrainTest\\nfrom deepchecks.base import Dataset\\nimport pandas as pd\";\n var nbb_formatted_code = \"from deepchecks.checks.integrity.new_category import CategoryMismatchTrainTest\\nfrom deepchecks.base import Dataset\\nimport pandas as pd\";\n var nbb_cells = Jupyter.notebook.get_cells();\n for (var i = 0; i < nbb_cells.length; ++i) {\n if (nbb_cells[i].input_prompt_number == nbb_cell_id) {\n if (nbb_cells[i].get_text() == nbb_unformatted_code) {\n nbb_cells[i].set_text(nbb_formatted_code);\n }\n break;\n }\n }\n }, 500);\n ",
40183950
"text/plain": [
40193951
"<IPython.core.display.Javascript object>"
40203952
]
@@ -4042,24 +3974,7 @@
40423974
"outputs": [
40433975
{
40443976
"data": {
4045-
"application/javascript": [
4046-
"\n",
4047-
" setTimeout(function() {\n",
4048-
" var nbb_cell_id = 19;\n",
4049-
" var nbb_unformatted_code = \"train = pd.DataFrame({'col1': ['a', 'b', 'c']})\\ntest = pd.DataFrame({'col1': ['c', 'd', 'e']})\\n\\ntrain_ds = Dataset(train, cat_features=['col1'])\\ntest_ds = Dataset(test, cat_features=['col1'])\";\n",
4050-
" var nbb_formatted_code = \"train = pd.DataFrame({\\\"col1\\\": [\\\"a\\\", \\\"b\\\", \\\"c\\\"]})\\ntest = pd.DataFrame({\\\"col1\\\": [\\\"c\\\", \\\"d\\\", \\\"e\\\"]})\\n\\ntrain_ds = Dataset(train, cat_features=[\\\"col1\\\"])\\ntest_ds = Dataset(test, cat_features=[\\\"col1\\\"])\";\n",
4051-
" var nbb_cells = Jupyter.notebook.get_cells();\n",
4052-
" for (var i = 0; i < nbb_cells.length; ++i) {\n",
4053-
" if (nbb_cells[i].input_prompt_number == nbb_cell_id) {\n",
4054-
" if (nbb_cells[i].get_text() == nbb_unformatted_code) {\n",
4055-
" nbb_cells[i].set_text(nbb_formatted_code);\n",
4056-
" }\n",
4057-
" break;\n",
4058-
" }\n",
4059-
" }\n",
4060-
" }, 500);\n",
4061-
" "
4062-
],
3977+
"application/javascript": "\n setTimeout(function() {\n var nbb_cell_id = 19;\n var nbb_unformatted_code = \"train = pd.DataFrame({'col1': ['a', 'b', 'c']})\\ntest = pd.DataFrame({'col1': ['c', 'd', 'e']})\\n\\ntrain_ds = Dataset(train, cat_features=['col1'])\\ntest_ds = Dataset(test, cat_features=['col1'])\";\n var nbb_formatted_code = \"train = pd.DataFrame({\\\"col1\\\": [\\\"a\\\", \\\"b\\\", \\\"c\\\"]})\\ntest = pd.DataFrame({\\\"col1\\\": [\\\"c\\\", \\\"d\\\", \\\"e\\\"]})\\n\\ntrain_ds = Dataset(train, cat_features=[\\\"col1\\\"])\\ntest_ds = Dataset(test, cat_features=[\\\"col1\\\"])\";\n var nbb_cells = Jupyter.notebook.get_cells();\n for (var i = 0; i < nbb_cells.length; ++i) {\n if (nbb_cells[i].input_prompt_number == nbb_cell_id) {\n if (nbb_cells[i].get_text() == nbb_unformatted_code) {\n nbb_cells[i].set_text(nbb_formatted_code);\n }\n break;\n }\n }\n }, 500);\n ",
40633978
"text/plain": [
40643979
"<IPython.core.display.Javascript object>"
40653980
]
@@ -4143,24 +4058,7 @@
41434058
},
41444059
{
41454060
"data": {
4146-
"application/javascript": [
4147-
"\n",
4148-
" setTimeout(function() {\n",
4149-
" var nbb_cell_id = 22;\n",
4150-
" var nbb_unformatted_code = \"CategoryMismatchTrainTest().run(train_ds, test_ds)\";\n",
4151-
" var nbb_formatted_code = \"CategoryMismatchTrainTest().run(train_ds, test_ds)\";\n",
4152-
" var nbb_cells = Jupyter.notebook.get_cells();\n",
4153-
" for (var i = 0; i < nbb_cells.length; ++i) {\n",
4154-
" if (nbb_cells[i].input_prompt_number == nbb_cell_id) {\n",
4155-
" if (nbb_cells[i].get_text() == nbb_unformatted_code) {\n",
4156-
" nbb_cells[i].set_text(nbb_formatted_code);\n",
4157-
" }\n",
4158-
" break;\n",
4159-
" }\n",
4160-
" }\n",
4161-
" }, 500);\n",
4162-
" "
4163-
],
4061+
"application/javascript": "\n setTimeout(function() {\n var nbb_cell_id = 22;\n var nbb_unformatted_code = \"CategoryMismatchTrainTest().run(train_ds, test_ds)\";\n var nbb_formatted_code = \"CategoryMismatchTrainTest().run(train_ds, test_ds)\";\n var nbb_cells = Jupyter.notebook.get_cells();\n for (var i = 0; i < nbb_cells.length; ++i) {\n if (nbb_cells[i].input_prompt_number == nbb_cell_id) {\n if (nbb_cells[i].get_text() == nbb_unformatted_code) {\n nbb_cells[i].set_text(nbb_formatted_code);\n }\n break;\n }\n }\n }, 500);\n ",
41644062
"text/plain": [
41654063
"<IPython.core.display.Javascript object>"
41664064
]
@@ -4723,7 +4621,7 @@
47234621
"celltoolbar": "Tags",
47244622
"hide_input": false,
47254623
"kernelspec": {
4726-
"display_name": "Python 3 (ipykernel)",
4624+
"display_name": "Python 3",
47274625
"language": "python",
47284626
"name": "python3"
47294627
},
@@ -4737,7 +4635,7 @@
47374635
"name": "python",
47384636
"nbconvert_exporter": "python",
47394637
"pygments_lexer": "ipython3",
4740-
"version": "3.11.6"
4638+
"version": "3.11.2"
47414639
},
47424640
"toc": {
47434641
"base_numbering": 1,
@@ -4751,11 +4649,6 @@
47514649
"toc_position": {},
47524650
"toc_section_display": true,
47534651
"toc_window_display": false
4754-
},
4755-
"vscode": {
4756-
"interpreter": {
4757-
"hash": "c3bc044b9863ed6dec4c55e7ad5af27f030f7d27aed3f39d7a4886a926c4e2c1"
4758-
}
47594652
}
47604653
},
47614654
"nbformat": 4,

0 commit comments

Comments
 (0)