Skip to content

Commit 153dec7

Browse files
add FlashText
1 parent aa99f96 commit 153dec7

File tree

2 files changed

+73
-11
lines changed

2 files changed

+73
-11
lines changed

Chapter5/natural_language_processing.ipynb

Lines changed: 69 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -24329,6 +24329,75 @@
2432924329
"source": [
2433024330
"[Link to Galatic](https://github.com/taylorai/galactic)."
2433124331
]
24332+
},
24333+
{
24334+
"cell_type": "markdown",
24335+
"id": "b0da6a5a",
24336+
"metadata": {},
24337+
"source": [
24338+
"### Efficient Keyword Extraction and Replacement with FlashText"
24339+
]
24340+
},
24341+
{
24342+
"cell_type": "code",
24343+
"execution_count": null,
24344+
"id": "6ee867c1",
24345+
"metadata": {
24346+
"tags": [
24347+
"hide-cell"
24348+
]
24349+
},
24350+
"outputs": [],
24351+
"source": [
24352+
"!pip install flashtext"
24353+
]
24354+
},
24355+
{
24356+
"cell_type": "markdown",
24357+
"id": "611bb3c5",
24358+
"metadata": {},
24359+
"source": [
24360+
"If you want to perform fast keyword extraction and replacement in text, use FlashText. "
24361+
]
24362+
},
24363+
{
24364+
"cell_type": "code",
24365+
"execution_count": 6,
24366+
"id": "a52f3e89",
24367+
"metadata": {},
24368+
"outputs": [
24369+
{
24370+
"data": {
24371+
"text/plain": [
24372+
"'Python is essential for data science.'"
24373+
]
24374+
},
24375+
"execution_count": 6,
24376+
"metadata": {},
24377+
"output_type": "execute_result"
24378+
}
24379+
],
24380+
"source": [
24381+
"from flashtext import KeywordProcessor\n",
24382+
"\n",
24383+
"keyword_processor = KeywordProcessor()\n",
24384+
"\n",
24385+
"# Adding keywords with replacements\n",
24386+
"keyword_processor.add_keyword(keyword=\"Python\")\n",
24387+
"keyword_processor.add_keyword(keyword=\"DS\", clean_name=\"data science\")\n",
24388+
"\n",
24389+
"# Replacing keywords in text\n",
24390+
"new_sentence = keyword_processor.replace_keywords(\"PYTHON is essential for DS.\")\n",
24391+
"new_sentence"
24392+
]
24393+
},
24394+
{
24395+
"cell_type": "markdown",
24396+
"id": "0b85c2a7",
24397+
"metadata": {},
24398+
"source": [
24399+
"[Link to FlashText](https://bit.ly/4bQ1eqt)."
24400+
]
2433224401
}
2433324402
],
2433424403
"metadata": {

Chapter5/spark.ipynb

Lines changed: 4 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -1655,12 +1655,12 @@
16551655
"source": [
16561656
"Standard UDF functions process data row-by-row, resulting in Python function call overhead. \n",
16571657
"\n",
1658-
"In contrast, pandas_udf utilizes Pandas' vectorized operations to process entire columns in a single operation, significantly improving performance."
1658+
"In contrast, pandas_udf uses Pandas' vectorized operations to process entire columns in a single operation, significantly improving performance."
16591659
]
16601660
},
16611661
{
16621662
"cell_type": "code",
1663-
"execution_count": 3,
1663+
"execution_count": 2,
16641664
"id": "a4633f44",
16651665
"metadata": {},
16661666
"outputs": [
@@ -1697,17 +1697,10 @@
16971697
},
16981698
{
16991699
"cell_type": "code",
1700-
"execution_count": 4,
1700+
"execution_count": 3,
17011701
"id": "fcf0cdf9",
17021702
"metadata": {},
17031703
"outputs": [
1704-
{
1705-
"name": "stderr",
1706-
"output_type": "stream",
1707-
"text": [
1708-
" \r"
1709-
]
1710-
},
17111704
{
17121705
"name": "stdout",
17131706
"output_type": "stream",
@@ -1738,7 +1731,7 @@
17381731
},
17391732
{
17401733
"cell_type": "code",
1741-
"execution_count": 8,
1734+
"execution_count": 4,
17421735
"id": "e1ec8b2b",
17431736
"metadata": {},
17441737
"outputs": [

0 commit comments

Comments
 (0)