add FlashText

khuyentran1401 · khuyentran1401 · commit 153dec7d4e42 · 2024-06-09T20:28:32.000-05:00
diff --git a/Chapter5/natural_language_processing.ipynb b/Chapter5/natural_language_processing.ipynb
@@ -24329,6 +24329,75 @@
    "source": [
     "[Link to Galatic](https://github.com/taylorai/galactic)."
    ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "b0da6a5a",
+   "metadata": {},
+   "source": [
+    "### Efficient Keyword Extraction and Replacement with FlashText"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "6ee867c1",
+   "metadata": {
+    "tags": [
+     "hide-cell"
+    ]
+   },
+   "outputs": [],
+   "source": [
+    "!pip install flashtext"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "611bb3c5",
+   "metadata": {},
+   "source": [
+    "If you want to perform fast keyword extraction and replacement in text, use FlashText. "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "id": "a52f3e89",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "'Python is essential for data science.'"
+      ]
+     },
+     "execution_count": 6,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "from flashtext import KeywordProcessor\n",
+    "\n",
+    "keyword_processor = KeywordProcessor()\n",
+    "\n",
+    "# Adding keywords with replacements\n",
+    "keyword_processor.add_keyword(keyword=\"Python\")\n",
+    "keyword_processor.add_keyword(keyword=\"DS\", clean_name=\"data science\")\n",
+    "\n",
+    "# Replacing keywords in text\n",
+    "new_sentence = keyword_processor.replace_keywords(\"PYTHON is essential for DS.\")\n",
+    "new_sentence"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "0b85c2a7",
+   "metadata": {},
+   "source": [
+    "[Link to FlashText](https://bit.ly/4bQ1eqt)."
+   ]
   }
  ],
  "metadata": {
diff --git a/Chapter5/spark.ipynb b/Chapter5/spark.ipynb
@@ -1655,12 +1655,12 @@
    "source": [
     "Standard UDF functions process data row-by-row, resulting in Python function call overhead. \n",
     "\n",
-    "In contrast, pandas_udf utilizes Pandas' vectorized operations to process entire columns in a single operation, significantly improving performance."
+    "In contrast, pandas_udf uses Pandas' vectorized operations to process entire columns in a single operation, significantly improving performance."
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": 3,
+   "execution_count": 2,
    "id": "a4633f44",
    "metadata": {},
    "outputs": [
@@ -1697,17 +1697,10 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 4,
+   "execution_count": 3,
    "id": "fcf0cdf9",
    "metadata": {},
    "outputs": [
-    {
-     "name": "stderr",
-     "output_type": "stream",
-     "text": [
-      "                                                                                \r"
-     ]
-    },
     {
      "name": "stdout",
      "output_type": "stream",
@@ -1738,7 +1731,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 8,
+   "execution_count": 4,
    "id": "e1ec8b2b",
    "metadata": {},
    "outputs": [