CodeCutTech
diff --git a/‎Chapter5/machine_learning.ipynb
Lines changed: 257 additions & 1 deletion b/‎Chapter5/machine_learning.ipynb
Lines changed: 257 additions & 1 deletion
diff --git a/‎Chapter5/mlflow_sklearn_load_data.py
Lines changed: 9 additions & 0 deletions b/‎Chapter5/mlflow_sklearn_load_data.py
Lines changed: 9 additions & 0 deletions
diff --git a/‎Chapter5/mlflow_sklearn_log_data.py
Lines changed: 17 additions & 0 deletions b/‎Chapter5/mlflow_sklearn_log_data.py
Lines changed: 17 additions & 0 deletions
@@ -2570,6 +2570,262 @@
    "source": [
     "[Link to AutoGluon](https://bit.ly/45ljoOd)."
    ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "d8376b0e",
+   "metadata": {},
+   "source": [
+    "### Model Logging Made Easy: MLflow vs. Pickle"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Here is why using MLflow to log model is superior to using pickle to save model:\n",
+    "\n",
+    "1. Managing Library Versions:\n",
+    "- Problem: Different models may require different versions of the same library, which can lead to conflicts. Manually tracking and setting up the correct environment for each model is time-consuming and error-prone.\n",
+    "- Solution: By automatically logging dependencies, MLflow ensures that anyone can recreate the exact environment needed to run the model.\n",
+    "\n",
+    "2. Documenting Inputs and Outputs: \n",
+    "- Problem: Often, the expected inputs and outputs of a model are not well-documented, making it difficult for others to use the model correctly.\n",
+    "- Solution: By defining a clear schema for inputs and outputs, MLflow ensures that anyone using the model knows exactly what data to provide and what to expect in return.\n",
+    "\n",
+    "To demonstrate the advantages of MLflow, let’s implement a simple logistic regression model and log it."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 20,
+   "id": "644b25c0",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Saving data to runs:/f8b0fc900aa14cf0ade8d0165c5a9f11/model\n"
+     ]
+    }
+   ],
+   "source": [
+    "import mlflow\n",
+    "from mlflow.models import infer_signature\n",
+    "import numpy as np\n",
+    "from sklearn.linear_model import LogisticRegression\n",
+    "\n",
+    "with mlflow.start_run():\n",
+    "    X = np.array([-2, -1, 0, 1, 2, 1]).reshape(-1, 1)\n",
+    "    y = np.array([0, 0, 1, 1, 1, 0])\n",
+    "    lr = LogisticRegression()\n",
+    "    lr.fit(X, y)\n",
+    "    signature = infer_signature(X, lr.predict(X))\n",
+    "\n",
+    "    model_info = mlflow.sklearn.log_model(\n",
+    "        sk_model=lr, artifact_path=\"model\", signature=signature\n",
+    "    )\n",
+    "\n",
+    "    print(f\"Saving data to {model_info.model_uri}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "28c9ff92",
+   "metadata": {},
+   "source": [
+    "The output indicates where the model has been saved. To use the logged model later, you can load it with the `model_uri`:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 21,
+   "id": "f88b0415",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import mlflow\n",
+    "import numpy as np\n",
+    "\n",
+    "model_uri = \"runs:/1e20d72afccf450faa3b8a9806a97e83/model\"\n",
+    "sklearn_pyfunc = mlflow.pyfunc.load_model(model_uri=model_uri)\n",
+    "\n",
+    "data = np.array([-4, 1, 0, 10, -2, 1]).reshape(-1, 1)\n",
+    "\n",
+    "predictions = sklearn_pyfunc.predict(data)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "58ebe221",
+   "metadata": {},
+   "source": [
+    "Let's inspect the artifacts saved with the model:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 22,
+   "id": "8acda9d6",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "/Users/khuyentran/book/Efficient_Python_tricks_and_tools_for_data_scientists/Chapter5/mlruns/0/1e20d72afccf450faa3b8a9806a97e83/artifacts/model\n",
+      "MLmodel           model.pkl         requirements.txt\n",
+      "conda.yaml        python_env.yaml\n"
+     ]
+    }
+   ],
+   "source": [
+    "%cd mlruns/0/1e20d72afccf450faa3b8a9806a97e83/artifacts/model\n",
+    "%ls"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "ced58a8d",
+   "metadata": {},
+   "source": [
+    "The `MLmodel` file provides essential information about the model, including dependencies and input/output specifications:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 23,
+   "id": "dc30e383",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "artifact_path: model\n",
+      "flavors:\n",
+      "  python_function:\n",
+      "    env:\n",
+      "      conda: conda.yaml\n",
+      "      virtualenv: python_env.yaml\n",
+      "    loader_module: mlflow.sklearn\n",
+      "    model_path: model.pkl\n",
+      "    predict_fn: predict\n",
+      "    python_version: 3.11.6\n",
+      "  sklearn:\n",
+      "    code: null\n",
+      "    pickled_model: model.pkl\n",
+      "    serialization_format: cloudpickle\n",
+      "    sklearn_version: 1.4.1.post1\n",
+      "mlflow_version: 2.15.0\n",
+      "model_size_bytes: 722\n",
+      "model_uuid: e7487bc3c4ab417c965144efcecaca2f\n",
+      "run_id: 1e20d72afccf450faa3b8a9806a97e83\n",
+      "signature:\n",
+      "  inputs: '[{\"type\": \"tensor\", \"tensor-spec\": {\"dtype\": \"int64\", \"shape\": [-1, 1]}}]'\n",
+      "  outputs: '[{\"type\": \"tensor\", \"tensor-spec\": {\"dtype\": \"int64\", \"shape\": [-1]}}]'\n",
+      "  params: null\n",
+      "utc_time_created: '2024-08-02 20:58:16.516963'\n"
+     ]
+    }
+   ],
+   "source": [
+    "%cat MLmodel"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "01ef7c12",
+   "metadata": {},
+   "source": [
+    "The `conda.yaml` and `python_env.yaml` files outline the environment dependencies, ensuring that the model runs in a consistent setup:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 24,
+   "id": "1dce0181",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "channels:\n",
+      "- conda-forge\n",
+      "dependencies:\n",
+      "- python=3.11.6\n",
+      "- pip<=24.2\n",
+      "- pip:\n",
+      "  - mlflow==2.15.0\n",
+      "  - cloudpickle==2.2.1\n",
+      "  - numpy==1.23.5\n",
+      "  - psutil==5.9.6\n",
+      "  - scikit-learn==1.4.1.post1\n",
+      "  - scipy==1.11.3\n",
+      "name: mlflow-env\n"
+     ]
+    }
+   ],
+   "source": [
+    "%cat conda.yaml"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 25,
+   "id": "16c2d3bc",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "python: 3.11.6\n",
+      "build_dependencies:\n",
+      "- pip==24.2\n",
+      "- setuptools\n",
+      "- wheel==0.40.0\n",
+      "dependencies:\n",
+      "- -r requirements.txt\n"
+     ]
+    }
+   ],
+   "source": [
+    "%cat python_env.yaml\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 26,
+   "id": "b16b2916",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "mlflow==2.15.0\n",
+      "cloudpickle==2.2.1\n",
+      "numpy==1.23.5\n",
+      "psutil==5.9.6\n",
+      "scikit-learn==1.4.1.post1\n",
+      "scipy==1.11.3"
+     ]
+    }
+   ],
+   "source": [
+    "%cat requirements.txt"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "00f7bbf0",
+   "metadata": {},
+   "source": [
+    "[Learn more about MLFlow Models](https://bit.ly/46y6gpF)."
+   ]
   }
  ],
  "metadata": {
@@ -2590,7 +2846,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.11.4"
+   "version": "3.11.6"
   },
   "toc": {
    "base_numbering": 1,
 
@@ -0,0 +1,9 @@
+import mlflow
+import numpy as np
+
+model_uri = "runs:/1e20d72afccf450faa3b8a9806a97e83/model"
+sklearn_pyfunc = mlflow.pyfunc.load_model(model_uri=model_uri)
+
+data = np.array([-4, 1, 0, 10, -2, 1]).reshape(-1, 1)
+
+predictions = sklearn_pyfunc.predict(data)
@@ -0,0 +1,17 @@
+import mlflow
+from mlflow.models import infer_signature
+import numpy as np
+from sklearn.linear_model import LogisticRegression
+
+with mlflow.start_run():
+    X = np.array([-2, -1, 0, 1, 2, 1]).reshape(-1, 1)
+    y = np.array([0, 0, 1, 1, 1, 0])
+    lr = LogisticRegression()
+    lr.fit(X, y)
+    signature = infer_signature(X, lr.predict(X))
+
+    model_info = mlflow.sklearn.log_model(
+        sk_model=lr, artifact_path="model", signature=signature
+    )
+
+    print(f"Saving data to {model_info.model_uri}")