Skip to content

Commit ec541ac

Browse files
add polars vs pandas
1 parent 1210a51 commit ec541ac

File tree

4 files changed

+243
-17
lines changed

4 files changed

+243
-17
lines changed

Chapter5/better_pandas.ipynb

Lines changed: 90 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -2982,12 +2982,100 @@
29822982
},
29832983
{
29842984
"cell_type": "markdown",
2985-
"id": "711c5d5a",
2985+
"id": "9c8a143e",
2986+
"metadata": {},
2987+
"source": [
2988+
"### Pandas vs Polars: Harnessing Parallelism for Faster Data Processing"
2989+
]
2990+
},
2991+
{
2992+
"cell_type": "code",
2993+
"execution_count": null,
2994+
"id": "bccc50c3",
2995+
"metadata": {
2996+
"tags": [
2997+
"hide-cell"
2998+
]
2999+
},
3000+
"outputs": [],
3001+
"source": [
3002+
"!pip install polars"
3003+
]
3004+
},
3005+
{
3006+
"cell_type": "markdown",
3007+
"id": "35b7b8c6",
3008+
"metadata": {},
3009+
"source": [
3010+
"Pandas is a single-threaded library, utilizing only a single CPU core. To achieve parallelism with Pandas, you would need to use additional libraries like Dask."
3011+
]
3012+
},
3013+
{
3014+
"cell_type": "code",
3015+
"execution_count": 1,
3016+
"id": "7ff191eb",
3017+
"metadata": {},
3018+
"outputs": [],
3019+
"source": [
3020+
"import pandas as pd\n",
3021+
"import multiprocessing as mp\n",
3022+
"import dask.dataframe as dd\n",
3023+
"\n",
3024+
"\n",
3025+
"df = pd.DataFrame({\"A\": range(1_000_000), \"B\": range(1_000_000)})\n",
3026+
"\n",
3027+
"# Perform the groupby and sum operation in parallel \n",
3028+
"ddf = dd.from_pandas(df, npartitions=mp.cpu_count())\n",
3029+
"result = ddf.groupby(\"A\").sum().compute()"
3030+
]
3031+
},
3032+
{
3033+
"cell_type": "markdown",
3034+
"id": "d1bd6806",
3035+
"metadata": {},
3036+
"source": [
3037+
"Polars, on the other hand, automatically leverages the available CPU cores without any additional configuration."
3038+
]
3039+
},
3040+
{
3041+
"cell_type": "code",
3042+
"execution_count": 3,
3043+
"id": "b26f5d69",
3044+
"metadata": {},
3045+
"outputs": [],
3046+
"source": [
3047+
"import polars as pl\n",
3048+
"\n",
3049+
"df = pl.DataFrame({\"A\": range(1_000_000), \"B\": range(1_000_000)})\n",
3050+
"\n",
3051+
"# Perform the groupby and sum operation in parallel \n",
3052+
"result = df.group_by(\"A\").sum()"
3053+
]
3054+
},
3055+
{
3056+
"cell_type": "markdown",
3057+
"id": "b050b3c8",
3058+
"metadata": {},
3059+
"source": [
3060+
"[Link to polars](https://bit.ly/3v9dmCT)."
3061+
]
3062+
},
3063+
{
3064+
"cell_type": "markdown",
3065+
"id": "a819b1e3",
29863066
"metadata": {},
29873067
"source": [
29883068
"### Simple and Expressive Data Transformation with Polars"
29893069
]
29903070
},
3071+
{
3072+
"cell_type": "markdown",
3073+
"id": "6c469e37",
3074+
"metadata": {},
3075+
"source": [
3076+
"Extract features and select only relevant features for each time series."
3077+
]
3078+
},
29913079
{
29923080
"cell_type": "code",
29933081
"execution_count": null,
@@ -3087,7 +3175,7 @@
30873175
{
30883176
"cell_type": "code",
30893177
"execution_count": 25,
3090-
"id": "e19f72fc",
3178+
"id": "6243b002",
30913179
"metadata": {},
30923180
"outputs": [
30933181
{

docs/Chapter5/better_pandas.html

Lines changed: 62 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -528,10 +528,11 @@ <h2> Contents </h2>
528528
<li class="toc-h2 nav-item toc-entry"><a class="reference internal nav-link" href="#polars-blazing-fast-dataframe-library">6.12.13. Polars: Blazing Fast DataFrame Library</a></li>
529529
<li class="toc-h2 nav-item toc-entry"><a class="reference internal nav-link" href="#polars-speed-up-data-processing-12x-with-lazy-execution">6.12.14. Polars: Speed Up Data Processing 12x with Lazy Execution</a></li>
530530
<li class="toc-h2 nav-item toc-entry"><a class="reference internal nav-link" href="#polars-vs-pandas-for-csv-loading-and-filtering">6.12.15. Polars vs. Pandas for CSV Loading and Filtering</a></li>
531-
<li class="toc-h2 nav-item toc-entry"><a class="reference internal nav-link" href="#simple-and-expressive-data-transformation-with-polars">6.12.16. Simple and Expressive Data Transformation with Polars</a></li>
532-
<li class="toc-h2 nav-item toc-entry"><a class="reference internal nav-link" href="#harness-polars-and-delta-lake-for-blazing-fast-performance">6.12.17. Harness Polars and Delta Lake for Blazing Fast Performance</a></li>
533-
<li class="toc-h2 nav-item toc-entry"><a class="reference internal nav-link" href="#parallel-execution-of-multiple-files-with-polars">6.12.18. Parallel Execution of Multiple Files with Polars</a></li>
534-
<li class="toc-h2 nav-item toc-entry"><a class="reference internal nav-link" href="#polars-streaming-mode-a-solution-for-large-data-sets">6.12.19. Polars’ Streaming Mode: A Solution for Large Data Sets</a></li>
531+
<li class="toc-h2 nav-item toc-entry"><a class="reference internal nav-link" href="#pandas-vs-polars-harnessing-parallelism-for-faster-data-processing">6.12.16. Pandas vs Polars: Harnessing Parallelism for Faster Data Processing</a></li>
532+
<li class="toc-h2 nav-item toc-entry"><a class="reference internal nav-link" href="#simple-and-expressive-data-transformation-with-polars">6.12.17. Simple and Expressive Data Transformation with Polars</a></li>
533+
<li class="toc-h2 nav-item toc-entry"><a class="reference internal nav-link" href="#harness-polars-and-delta-lake-for-blazing-fast-performance">6.12.18. Harness Polars and Delta Lake for Blazing Fast Performance</a></li>
534+
<li class="toc-h2 nav-item toc-entry"><a class="reference internal nav-link" href="#parallel-execution-of-multiple-files-with-polars">6.12.19. Parallel Execution of Multiple Files with Polars</a></li>
535+
<li class="toc-h2 nav-item toc-entry"><a class="reference internal nav-link" href="#polars-streaming-mode-a-solution-for-large-data-sets">6.12.20. Polars’ Streaming Mode: A Solution for Large Data Sets</a></li>
535536
</ul>
536537
</nav>
537538
</div>
@@ -2497,8 +2498,56 @@ <h2><span class="section-number">6.12.15. </span>Polars vs. Pandas for CSV Loadi
24972498
</div>
24982499
</div>
24992500
</section>
2501+
<section id="pandas-vs-polars-harnessing-parallelism-for-faster-data-processing">
2502+
<h2><span class="section-number">6.12.16. </span>Pandas vs Polars: Harnessing Parallelism for Faster Data Processing<a class="headerlink" href="#pandas-vs-polars-harnessing-parallelism-for-faster-data-processing" title="Permalink to this heading">#</a></h2>
2503+
<div class="cell tag_hide-cell docutils container">
2504+
<details class="hide above-input">
2505+
<summary aria-label="Toggle hidden content">
2506+
<span class="collapsed">Show code cell content</span>
2507+
<span class="expanded">Hide code cell content</span>
2508+
</summary>
2509+
<div class="cell_input docutils container">
2510+
<div class="highlight-ipython3 notranslate"><div class="highlight"><pre><span></span><span class="o">!</span>pip<span class="w"> </span>install<span class="w"> </span>polars
2511+
</pre></div>
2512+
</div>
2513+
</div>
2514+
</details>
2515+
</div>
2516+
<p>Pandas is a single-threaded library, utilizing only a single CPU core. To achieve parallelism with Pandas, you would need to use additional libraries like Dask.</p>
2517+
<div class="cell docutils container">
2518+
<div class="cell_input docutils container">
2519+
<div class="highlight-ipython3 notranslate"><div class="highlight"><pre><span></span><span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>
2520+
<span class="kn">import</span> <span class="nn">multiprocessing</span> <span class="k">as</span> <span class="nn">mp</span>
2521+
<span class="kn">import</span> <span class="nn">dask.dataframe</span> <span class="k">as</span> <span class="nn">dd</span>
2522+
2523+
2524+
<span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">DataFrame</span><span class="p">({</span><span class="s2">&quot;A&quot;</span><span class="p">:</span> <span class="nb">range</span><span class="p">(</span><span class="mi">1_000_000</span><span class="p">),</span> <span class="s2">&quot;B&quot;</span><span class="p">:</span> <span class="nb">range</span><span class="p">(</span><span class="mi">1_000_000</span><span class="p">)})</span>
2525+
2526+
<span class="c1"># Perform the groupby and sum operation in parallel </span>
2527+
<span class="n">ddf</span> <span class="o">=</span> <span class="n">dd</span><span class="o">.</span><span class="n">from_pandas</span><span class="p">(</span><span class="n">df</span><span class="p">,</span> <span class="n">npartitions</span><span class="o">=</span><span class="n">mp</span><span class="o">.</span><span class="n">cpu_count</span><span class="p">())</span>
2528+
<span class="n">result</span> <span class="o">=</span> <span class="n">ddf</span><span class="o">.</span><span class="n">groupby</span><span class="p">(</span><span class="s2">&quot;A&quot;</span><span class="p">)</span><span class="o">.</span><span class="n">sum</span><span class="p">()</span><span class="o">.</span><span class="n">compute</span><span class="p">()</span>
2529+
</pre></div>
2530+
</div>
2531+
</div>
2532+
</div>
2533+
<p>Polars, on the other hand, automatically leverages the available CPU cores without any additional configuration.</p>
2534+
<div class="cell docutils container">
2535+
<div class="cell_input docutils container">
2536+
<div class="highlight-ipython3 notranslate"><div class="highlight"><pre><span></span><span class="kn">import</span> <span class="nn">polars</span> <span class="k">as</span> <span class="nn">pl</span>
2537+
2538+
<span class="n">df</span> <span class="o">=</span> <span class="n">pl</span><span class="o">.</span><span class="n">DataFrame</span><span class="p">({</span><span class="s2">&quot;A&quot;</span><span class="p">:</span> <span class="nb">range</span><span class="p">(</span><span class="mi">1_000_000</span><span class="p">),</span> <span class="s2">&quot;B&quot;</span><span class="p">:</span> <span class="nb">range</span><span class="p">(</span><span class="mi">1_000_000</span><span class="p">)})</span>
2539+
2540+
<span class="c1"># Perform the groupby and sum operation in parallel </span>
2541+
<span class="n">result</span> <span class="o">=</span> <span class="n">df</span><span class="o">.</span><span class="n">group_by</span><span class="p">(</span><span class="s2">&quot;A&quot;</span><span class="p">)</span><span class="o">.</span><span class="n">sum</span><span class="p">()</span>
2542+
</pre></div>
2543+
</div>
2544+
</div>
2545+
</div>
2546+
<p><a class="reference external" href="https://bit.ly/3v9dmCT">Link to polars</a>.</p>
2547+
</section>
25002548
<section id="simple-and-expressive-data-transformation-with-polars">
2501-
<h2><span class="section-number">6.12.16. </span>Simple and Expressive Data Transformation with Polars<a class="headerlink" href="#simple-and-expressive-data-transformation-with-polars" title="Permalink to this heading">#</a></h2>
2549+
<h2><span class="section-number">6.12.17. </span>Simple and Expressive Data Transformation with Polars<a class="headerlink" href="#simple-and-expressive-data-transformation-with-polars" title="Permalink to this heading">#</a></h2>
2550+
<p>Extract features and select only relevant features for each time series.</p>
25022551
<div class="cell tag_hide-cell docutils container">
25032552
<details class="hide above-input">
25042553
<summary aria-label="Toggle hidden content">
@@ -2592,7 +2641,7 @@ <h2><span class="section-number">6.12.16. </span>Simple and Expressive Data Tran
25922641
</div>
25932642
</section>
25942643
<section id="harness-polars-and-delta-lake-for-blazing-fast-performance">
2595-
<h2><span class="section-number">6.12.17. </span>Harness Polars and Delta Lake for Blazing Fast Performance<a class="headerlink" href="#harness-polars-and-delta-lake-for-blazing-fast-performance" title="Permalink to this heading">#</a></h2>
2644+
<h2><span class="section-number">6.12.18. </span>Harness Polars and Delta Lake for Blazing Fast Performance<a class="headerlink" href="#harness-polars-and-delta-lake-for-blazing-fast-performance" title="Permalink to this heading">#</a></h2>
25962645
<div class="cell tag_hide-cell docutils container">
25972646
<details class="hide above-input">
25982647
<summary aria-label="Toggle hidden content">
@@ -2813,7 +2862,7 @@ <h2><span class="section-number">6.12.17. </span>Harness Polars and Delta Lake f
28132862
<p><a class="reference external" href="https://github.com/delta-io/delta-rs">Link to delta-rs</a>.</p>
28142863
</section>
28152864
<section id="parallel-execution-of-multiple-files-with-polars">
2816-
<h2><span class="section-number">6.12.18. </span>Parallel Execution of Multiple Files with Polars<a class="headerlink" href="#parallel-execution-of-multiple-files-with-polars" title="Permalink to this heading">#</a></h2>
2865+
<h2><span class="section-number">6.12.19. </span>Parallel Execution of Multiple Files with Polars<a class="headerlink" href="#parallel-execution-of-multiple-files-with-polars" title="Permalink to this heading">#</a></h2>
28172866
<div class="cell tag_hide-cell docutils container">
28182867
<details class="hide above-input">
28192868
<summary aria-label="Toggle hidden content">
@@ -2884,7 +2933,7 @@ <h2><span class="section-number">6.12.18. </span>Parallel Execution of Multiple
28842933
<p><a class="reference external" href="https://github.com/pola-rs/polars">Link to polars</a></p>
28852934
</section>
28862935
<section id="polars-streaming-mode-a-solution-for-large-data-sets">
2887-
<h2><span class="section-number">6.12.19. </span>Polars’ Streaming Mode: A Solution for Large Data Sets<a class="headerlink" href="#polars-streaming-mode-a-solution-for-large-data-sets" title="Permalink to this heading">#</a></h2>
2936+
<h2><span class="section-number">6.12.20. </span>Polars’ Streaming Mode: A Solution for Large Data Sets<a class="headerlink" href="#polars-streaming-mode-a-solution-for-large-data-sets" title="Permalink to this heading">#</a></h2>
28882937
<div class="cell tag_hide-cell docutils container">
28892938
<details class="hide above-input">
28902939
<summary aria-label="Toggle hidden content">
@@ -2996,10 +3045,11 @@ <h2><span class="section-number">6.12.19. </span>Polars’ Streaming Mode: A Sol
29963045
<li class="toc-h2 nav-item toc-entry"><a class="reference internal nav-link" href="#polars-blazing-fast-dataframe-library">6.12.13. Polars: Blazing Fast DataFrame Library</a></li>
29973046
<li class="toc-h2 nav-item toc-entry"><a class="reference internal nav-link" href="#polars-speed-up-data-processing-12x-with-lazy-execution">6.12.14. Polars: Speed Up Data Processing 12x with Lazy Execution</a></li>
29983047
<li class="toc-h2 nav-item toc-entry"><a class="reference internal nav-link" href="#polars-vs-pandas-for-csv-loading-and-filtering">6.12.15. Polars vs. Pandas for CSV Loading and Filtering</a></li>
2999-
<li class="toc-h2 nav-item toc-entry"><a class="reference internal nav-link" href="#simple-and-expressive-data-transformation-with-polars">6.12.16. Simple and Expressive Data Transformation with Polars</a></li>
3000-
<li class="toc-h2 nav-item toc-entry"><a class="reference internal nav-link" href="#harness-polars-and-delta-lake-for-blazing-fast-performance">6.12.17. Harness Polars and Delta Lake for Blazing Fast Performance</a></li>
3001-
<li class="toc-h2 nav-item toc-entry"><a class="reference internal nav-link" href="#parallel-execution-of-multiple-files-with-polars">6.12.18. Parallel Execution of Multiple Files with Polars</a></li>
3002-
<li class="toc-h2 nav-item toc-entry"><a class="reference internal nav-link" href="#polars-streaming-mode-a-solution-for-large-data-sets">6.12.19. Polars’ Streaming Mode: A Solution for Large Data Sets</a></li>
3048+
<li class="toc-h2 nav-item toc-entry"><a class="reference internal nav-link" href="#pandas-vs-polars-harnessing-parallelism-for-faster-data-processing">6.12.16. Pandas vs Polars: Harnessing Parallelism for Faster Data Processing</a></li>
3049+
<li class="toc-h2 nav-item toc-entry"><a class="reference internal nav-link" href="#simple-and-expressive-data-transformation-with-polars">6.12.17. Simple and Expressive Data Transformation with Polars</a></li>
3050+
<li class="toc-h2 nav-item toc-entry"><a class="reference internal nav-link" href="#harness-polars-and-delta-lake-for-blazing-fast-performance">6.12.18. Harness Polars and Delta Lake for Blazing Fast Performance</a></li>
3051+
<li class="toc-h2 nav-item toc-entry"><a class="reference internal nav-link" href="#parallel-execution-of-multiple-files-with-polars">6.12.19. Parallel Execution of Multiple Files with Polars</a></li>
3052+
<li class="toc-h2 nav-item toc-entry"><a class="reference internal nav-link" href="#polars-streaming-mode-a-solution-for-large-data-sets">6.12.20. Polars’ Streaming Mode: A Solution for Large Data Sets</a></li>
30033053
</ul>
30043054
</nav></div>
30053055

docs/_sources/Chapter5/better_pandas.ipynb

Lines changed: 90 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -2982,12 +2982,100 @@
29822982
},
29832983
{
29842984
"cell_type": "markdown",
2985-
"id": "711c5d5a",
2985+
"id": "9c8a143e",
2986+
"metadata": {},
2987+
"source": [
2988+
"### Pandas vs Polars: Harnessing Parallelism for Faster Data Processing"
2989+
]
2990+
},
2991+
{
2992+
"cell_type": "code",
2993+
"execution_count": null,
2994+
"id": "bccc50c3",
2995+
"metadata": {
2996+
"tags": [
2997+
"hide-cell"
2998+
]
2999+
},
3000+
"outputs": [],
3001+
"source": [
3002+
"!pip install polars"
3003+
]
3004+
},
3005+
{
3006+
"cell_type": "markdown",
3007+
"id": "35b7b8c6",
3008+
"metadata": {},
3009+
"source": [
3010+
"Pandas is a single-threaded library, utilizing only a single CPU core. To achieve parallelism with Pandas, you would need to use additional libraries like Dask."
3011+
]
3012+
},
3013+
{
3014+
"cell_type": "code",
3015+
"execution_count": 1,
3016+
"id": "7ff191eb",
3017+
"metadata": {},
3018+
"outputs": [],
3019+
"source": [
3020+
"import pandas as pd\n",
3021+
"import multiprocessing as mp\n",
3022+
"import dask.dataframe as dd\n",
3023+
"\n",
3024+
"\n",
3025+
"df = pd.DataFrame({\"A\": range(1_000_000), \"B\": range(1_000_000)})\n",
3026+
"\n",
3027+
"# Perform the groupby and sum operation in parallel \n",
3028+
"ddf = dd.from_pandas(df, npartitions=mp.cpu_count())\n",
3029+
"result = ddf.groupby(\"A\").sum().compute()"
3030+
]
3031+
},
3032+
{
3033+
"cell_type": "markdown",
3034+
"id": "d1bd6806",
3035+
"metadata": {},
3036+
"source": [
3037+
"Polars, on the other hand, automatically leverages the available CPU cores without any additional configuration."
3038+
]
3039+
},
3040+
{
3041+
"cell_type": "code",
3042+
"execution_count": 3,
3043+
"id": "b26f5d69",
3044+
"metadata": {},
3045+
"outputs": [],
3046+
"source": [
3047+
"import polars as pl\n",
3048+
"\n",
3049+
"df = pl.DataFrame({\"A\": range(1_000_000), \"B\": range(1_000_000)})\n",
3050+
"\n",
3051+
"# Perform the groupby and sum operation in parallel \n",
3052+
"result = df.group_by(\"A\").sum()"
3053+
]
3054+
},
3055+
{
3056+
"cell_type": "markdown",
3057+
"id": "b050b3c8",
3058+
"metadata": {},
3059+
"source": [
3060+
"[Link to polars](https://bit.ly/3v9dmCT)."
3061+
]
3062+
},
3063+
{
3064+
"cell_type": "markdown",
3065+
"id": "a819b1e3",
29863066
"metadata": {},
29873067
"source": [
29883068
"### Simple and Expressive Data Transformation with Polars"
29893069
]
29903070
},
3071+
{
3072+
"cell_type": "markdown",
3073+
"id": "6c469e37",
3074+
"metadata": {},
3075+
"source": [
3076+
"Extract features and select only relevant features for each time series."
3077+
]
3078+
},
29913079
{
29923080
"cell_type": "code",
29933081
"execution_count": null,
@@ -3087,7 +3175,7 @@
30873175
{
30883176
"cell_type": "code",
30893177
"execution_count": 25,
3090-
"id": "e19f72fc",
3178+
"id": "6243b002",
30913179
"metadata": {},
30923180
"outputs": [
30933181
{

docs/searchindex.js

Lines changed: 1 addition & 1 deletion
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

0 commit comments

Comments
 (0)