Skip to content

Commit 3da19ba

Browse files
add FlashText
1 parent 153dec7 commit 3da19ba

File tree

6 files changed

+119
-18
lines changed

6 files changed

+119
-18
lines changed

.gitignore

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -17,5 +17,5 @@ venv
1717
.DS_Store
1818
data
1919
*_scripts/*
20-
get_file_size.sh
21-
*.csv
20+
*.csv
21+
generate_url.py

docs/Chapter5/natural_language_processing.html

Lines changed: 42 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -535,6 +535,7 @@ <h2> Contents </h2>
535535
<li class="toc-h2 nav-item toc-entry"><a class="reference internal nav-link" href="#ekphrasis-text-processing-tool-for-social-media-text">6.6.20. ekphrasis: Text Processing Tool For Social Media Text</a></li>
536536
<li class="toc-h2 nav-item toc-entry"><a class="reference internal nav-link" href="#chroma-the-lightning-fast-solution-to-text-embeddings-and-querying">6.6.21. Chroma: The Lightning-Fast Solution to Text Embeddings and Querying</a></li>
537537
<li class="toc-h2 nav-item toc-entry"><a class="reference internal nav-link" href="#galatic-clean-and-analyze-massive-text-datasets">6.6.22. Galatic: Clean and Analyze Massive Text Datasets</a></li>
538+
<li class="toc-h2 nav-item toc-entry"><a class="reference internal nav-link" href="#efficient-keyword-extraction-and-replacement-with-flashtext">6.6.23. Efficient Keyword Extraction and Replacement with FlashText</a></li>
538539
</ul>
539540
</nav>
540541
</div>
@@ -2368,6 +2369,46 @@ <h2><span class="section-number">6.6.22. </span>Galatic: Clean and Analyze Massi
23682369
</div>
23692370
<p><a class="reference external" href="https://github.com/taylorai/galactic">Link to Galatic</a>.</p>
23702371
</section>
2372+
<section id="efficient-keyword-extraction-and-replacement-with-flashtext">
2373+
<h2><span class="section-number">6.6.23. </span>Efficient Keyword Extraction and Replacement with FlashText<a class="headerlink" href="#efficient-keyword-extraction-and-replacement-with-flashtext" title="Permalink to this heading">#</a></h2>
2374+
<div class="cell tag_hide-cell docutils container">
2375+
<details class="hide above-input">
2376+
<summary aria-label="Toggle hidden content">
2377+
<span class="collapsed">Show code cell content</span>
2378+
<span class="expanded">Hide code cell content</span>
2379+
</summary>
2380+
<div class="cell_input docutils container">
2381+
<div class="highlight-ipython3 notranslate"><div class="highlight"><pre><span></span><span class="o">!</span>pip<span class="w"> </span>install<span class="w"> </span>flashtext
2382+
</pre></div>
2383+
</div>
2384+
</div>
2385+
</details>
2386+
</div>
2387+
<p>If you want to perform fast keyword extraction and replacement in text, use FlashText.</p>
2388+
<div class="cell docutils container">
2389+
<div class="cell_input docutils container">
2390+
<div class="highlight-ipython3 notranslate"><div class="highlight"><pre><span></span><span class="kn">from</span> <span class="nn">flashtext</span> <span class="kn">import</span> <span class="n">KeywordProcessor</span>
2391+
2392+
<span class="n">keyword_processor</span> <span class="o">=</span> <span class="n">KeywordProcessor</span><span class="p">()</span>
2393+
2394+
<span class="c1"># Adding keywords with replacements</span>
2395+
<span class="n">keyword_processor</span><span class="o">.</span><span class="n">add_keyword</span><span class="p">(</span><span class="n">keyword</span><span class="o">=</span><span class="s2">&quot;Python&quot;</span><span class="p">)</span>
2396+
<span class="n">keyword_processor</span><span class="o">.</span><span class="n">add_keyword</span><span class="p">(</span><span class="n">keyword</span><span class="o">=</span><span class="s2">&quot;DS&quot;</span><span class="p">,</span> <span class="n">clean_name</span><span class="o">=</span><span class="s2">&quot;data science&quot;</span><span class="p">)</span>
2397+
2398+
<span class="c1"># Replacing keywords in text</span>
2399+
<span class="n">new_sentence</span> <span class="o">=</span> <span class="n">keyword_processor</span><span class="o">.</span><span class="n">replace_keywords</span><span class="p">(</span><span class="s2">&quot;PYTHON is essential for DS.&quot;</span><span class="p">)</span>
2400+
<span class="n">new_sentence</span>
2401+
</pre></div>
2402+
</div>
2403+
</div>
2404+
<div class="cell_output docutils container">
2405+
<div class="output text_plain highlight-myst-ansi notranslate"><div class="highlight"><pre><span></span>&#39;Python is essential for data science.&#39;
2406+
</pre></div>
2407+
</div>
2408+
</div>
2409+
</div>
2410+
<p><a class="reference external" href="https://bit.ly/4bQ1eqt">Link to FlashText</a>.</p>
2411+
</section>
23712412
</section>
23722413

23732414
<script type="text/x-thebe-config">
@@ -2455,6 +2496,7 @@ <h2><span class="section-number">6.6.22. </span>Galatic: Clean and Analyze Massi
24552496
<li class="toc-h2 nav-item toc-entry"><a class="reference internal nav-link" href="#ekphrasis-text-processing-tool-for-social-media-text">6.6.20. ekphrasis: Text Processing Tool For Social Media Text</a></li>
24562497
<li class="toc-h2 nav-item toc-entry"><a class="reference internal nav-link" href="#chroma-the-lightning-fast-solution-to-text-embeddings-and-querying">6.6.21. Chroma: The Lightning-Fast Solution to Text Embeddings and Querying</a></li>
24572498
<li class="toc-h2 nav-item toc-entry"><a class="reference internal nav-link" href="#galatic-clean-and-analyze-massive-text-datasets">6.6.22. Galatic: Clean and Analyze Massive Text Datasets</a></li>
2499+
<li class="toc-h2 nav-item toc-entry"><a class="reference internal nav-link" href="#efficient-keyword-extraction-and-replacement-with-flashtext">6.6.23. Efficient Keyword Extraction and Replacement with FlashText</a></li>
24582500
</ul>
24592501
</nav></div>
24602502

docs/Chapter5/spark.html

Lines changed: 1 addition & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1656,7 +1656,7 @@ <h2><span class="section-number">6.15.9. </span>Vectorized Operations in PySpark
16561656
</div>
16571657
</div>
16581658
<p>Standard UDF functions process data row-by-row, resulting in Python function call overhead.</p>
1659-
<p>In contrast, pandas_udf utilizes Pandas’ vectorized operations to process entire columns in a single operation, significantly improving performance.</p>
1659+
<p>In contrast, pandas_udf uses Pandas’ vectorized operations to process entire columns in a single operation, significantly improving performance.</p>
16601660
<div class="cell docutils container">
16611661
<div class="cell_input docutils container">
16621662
<div class="highlight-ipython3 notranslate"><div class="highlight"><pre><span></span><span class="c1"># Sample DataFrame</span>
@@ -1698,9 +1698,6 @@ <h2><span class="section-number">6.15.9. </span>Vectorized Operations in PySpark
16981698
</div>
16991699
</div>
17001700
<div class="cell_output docutils container">
1701-
<div class="output stderr highlight-myst-ansi notranslate"><div class="highlight"><pre><span></span>
1702-
</pre></div>
1703-
</div>
17041701
<div class="output stream highlight-myst-ansi notranslate"><div class="highlight"><pre><span></span>+----+----+
17051702
|val1|val2|
17061703
+----+----+

docs/_sources/Chapter5/natural_language_processing.ipynb

Lines changed: 69 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -24329,6 +24329,75 @@
2432924329
"source": [
2433024330
"[Link to Galatic](https://github.com/taylorai/galactic)."
2433124331
]
24332+
},
24333+
{
24334+
"cell_type": "markdown",
24335+
"id": "b0da6a5a",
24336+
"metadata": {},
24337+
"source": [
24338+
"### Efficient Keyword Extraction and Replacement with FlashText"
24339+
]
24340+
},
24341+
{
24342+
"cell_type": "code",
24343+
"execution_count": null,
24344+
"id": "6ee867c1",
24345+
"metadata": {
24346+
"tags": [
24347+
"hide-cell"
24348+
]
24349+
},
24350+
"outputs": [],
24351+
"source": [
24352+
"!pip install flashtext"
24353+
]
24354+
},
24355+
{
24356+
"cell_type": "markdown",
24357+
"id": "611bb3c5",
24358+
"metadata": {},
24359+
"source": [
24360+
"If you want to perform fast keyword extraction and replacement in text, use FlashText. "
24361+
]
24362+
},
24363+
{
24364+
"cell_type": "code",
24365+
"execution_count": 6,
24366+
"id": "a52f3e89",
24367+
"metadata": {},
24368+
"outputs": [
24369+
{
24370+
"data": {
24371+
"text/plain": [
24372+
"'Python is essential for data science.'"
24373+
]
24374+
},
24375+
"execution_count": 6,
24376+
"metadata": {},
24377+
"output_type": "execute_result"
24378+
}
24379+
],
24380+
"source": [
24381+
"from flashtext import KeywordProcessor\n",
24382+
"\n",
24383+
"keyword_processor = KeywordProcessor()\n",
24384+
"\n",
24385+
"# Adding keywords with replacements\n",
24386+
"keyword_processor.add_keyword(keyword=\"Python\")\n",
24387+
"keyword_processor.add_keyword(keyword=\"DS\", clean_name=\"data science\")\n",
24388+
"\n",
24389+
"# Replacing keywords in text\n",
24390+
"new_sentence = keyword_processor.replace_keywords(\"PYTHON is essential for DS.\")\n",
24391+
"new_sentence"
24392+
]
24393+
},
24394+
{
24395+
"cell_type": "markdown",
24396+
"id": "0b85c2a7",
24397+
"metadata": {},
24398+
"source": [
24399+
"[Link to FlashText](https://bit.ly/4bQ1eqt)."
24400+
]
2433224401
}
2433324402
],
2433424403
"metadata": {

docs/_sources/Chapter5/spark.ipynb

Lines changed: 4 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -1655,12 +1655,12 @@
16551655
"source": [
16561656
"Standard UDF functions process data row-by-row, resulting in Python function call overhead. \n",
16571657
"\n",
1658-
"In contrast, pandas_udf utilizes Pandas' vectorized operations to process entire columns in a single operation, significantly improving performance."
1658+
"In contrast, pandas_udf uses Pandas' vectorized operations to process entire columns in a single operation, significantly improving performance."
16591659
]
16601660
},
16611661
{
16621662
"cell_type": "code",
1663-
"execution_count": 3,
1663+
"execution_count": 2,
16641664
"id": "a4633f44",
16651665
"metadata": {},
16661666
"outputs": [
@@ -1697,17 +1697,10 @@
16971697
},
16981698
{
16991699
"cell_type": "code",
1700-
"execution_count": 4,
1700+
"execution_count": 3,
17011701
"id": "fcf0cdf9",
17021702
"metadata": {},
17031703
"outputs": [
1704-
{
1705-
"name": "stderr",
1706-
"output_type": "stream",
1707-
"text": [
1708-
" \r"
1709-
]
1710-
},
17111704
{
17121705
"name": "stdout",
17131706
"output_type": "stream",
@@ -1738,7 +1731,7 @@
17381731
},
17391732
{
17401733
"cell_type": "code",
1741-
"execution_count": 8,
1734+
"execution_count": 4,
17421735
"id": "e1ec8b2b",
17431736
"metadata": {},
17441737
"outputs": [

docs/searchindex.js

Lines changed: 1 addition & 1 deletion
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

0 commit comments

Comments
 (0)