|
2570 | 2570 | "source": [
|
2571 | 2571 | "[Link to AutoGluon](https://bit.ly/45ljoOd)."
|
2572 | 2572 | ]
|
| 2573 | + }, |
| 2574 | + { |
| 2575 | + "cell_type": "markdown", |
| 2576 | + "id": "d8376b0e", |
| 2577 | + "metadata": {}, |
| 2578 | + "source": [ |
| 2579 | + "### Model Logging Made Easy: MLflow vs. Pickle" |
| 2580 | + ] |
| 2581 | + }, |
| 2582 | + { |
| 2583 | + "cell_type": "markdown", |
| 2584 | + "metadata": {}, |
| 2585 | + "source": [ |
| 2586 | + "Here is why using MLflow to log model is superior to using pickle to save model:\n", |
| 2587 | + "\n", |
| 2588 | + "1. Managing Library Versions:\n", |
| 2589 | + "- Problem: Different models may require different versions of the same library, which can lead to conflicts. Manually tracking and setting up the correct environment for each model is time-consuming and error-prone.\n", |
| 2590 | + "- Solution: By automatically logging dependencies, MLflow ensures that anyone can recreate the exact environment needed to run the model.\n", |
| 2591 | + "\n", |
| 2592 | + "2. Documenting Inputs and Outputs: \n", |
| 2593 | + "- Problem: Often, the expected inputs and outputs of a model are not well-documented, making it difficult for others to use the model correctly.\n", |
| 2594 | + "- Solution: By defining a clear schema for inputs and outputs, MLflow ensures that anyone using the model knows exactly what data to provide and what to expect in return.\n", |
| 2595 | + "\n", |
| 2596 | + "To demonstrate the advantages of MLflow, let’s implement a simple logistic regression model and log it." |
| 2597 | + ] |
| 2598 | + }, |
| 2599 | + { |
| 2600 | + "cell_type": "code", |
| 2601 | + "execution_count": 20, |
| 2602 | + "id": "644b25c0", |
| 2603 | + "metadata": {}, |
| 2604 | + "outputs": [ |
| 2605 | + { |
| 2606 | + "name": "stdout", |
| 2607 | + "output_type": "stream", |
| 2608 | + "text": [ |
| 2609 | + "Saving data to runs:/f8b0fc900aa14cf0ade8d0165c5a9f11/model\n" |
| 2610 | + ] |
| 2611 | + } |
| 2612 | + ], |
| 2613 | + "source": [ |
| 2614 | + "import mlflow\n", |
| 2615 | + "from mlflow.models import infer_signature\n", |
| 2616 | + "import numpy as np\n", |
| 2617 | + "from sklearn.linear_model import LogisticRegression\n", |
| 2618 | + "\n", |
| 2619 | + "with mlflow.start_run():\n", |
| 2620 | + " X = np.array([-2, -1, 0, 1, 2, 1]).reshape(-1, 1)\n", |
| 2621 | + " y = np.array([0, 0, 1, 1, 1, 0])\n", |
| 2622 | + " lr = LogisticRegression()\n", |
| 2623 | + " lr.fit(X, y)\n", |
| 2624 | + " signature = infer_signature(X, lr.predict(X))\n", |
| 2625 | + "\n", |
| 2626 | + " model_info = mlflow.sklearn.log_model(\n", |
| 2627 | + " sk_model=lr, artifact_path=\"model\", signature=signature\n", |
| 2628 | + " )\n", |
| 2629 | + "\n", |
| 2630 | + " print(f\"Saving data to {model_info.model_uri}\")" |
| 2631 | + ] |
| 2632 | + }, |
| 2633 | + { |
| 2634 | + "cell_type": "markdown", |
| 2635 | + "id": "28c9ff92", |
| 2636 | + "metadata": {}, |
| 2637 | + "source": [ |
| 2638 | + "The output indicates where the model has been saved. To use the logged model later, you can load it with the `model_uri`:" |
| 2639 | + ] |
| 2640 | + }, |
| 2641 | + { |
| 2642 | + "cell_type": "code", |
| 2643 | + "execution_count": 21, |
| 2644 | + "id": "f88b0415", |
| 2645 | + "metadata": {}, |
| 2646 | + "outputs": [], |
| 2647 | + "source": [ |
| 2648 | + "import mlflow\n", |
| 2649 | + "import numpy as np\n", |
| 2650 | + "\n", |
| 2651 | + "model_uri = \"runs:/1e20d72afccf450faa3b8a9806a97e83/model\"\n", |
| 2652 | + "sklearn_pyfunc = mlflow.pyfunc.load_model(model_uri=model_uri)\n", |
| 2653 | + "\n", |
| 2654 | + "data = np.array([-4, 1, 0, 10, -2, 1]).reshape(-1, 1)\n", |
| 2655 | + "\n", |
| 2656 | + "predictions = sklearn_pyfunc.predict(data)" |
| 2657 | + ] |
| 2658 | + }, |
| 2659 | + { |
| 2660 | + "cell_type": "markdown", |
| 2661 | + "id": "58ebe221", |
| 2662 | + "metadata": {}, |
| 2663 | + "source": [ |
| 2664 | + "Let's inspect the artifacts saved with the model:" |
| 2665 | + ] |
| 2666 | + }, |
| 2667 | + { |
| 2668 | + "cell_type": "code", |
| 2669 | + "execution_count": 22, |
| 2670 | + "id": "8acda9d6", |
| 2671 | + "metadata": {}, |
| 2672 | + "outputs": [ |
| 2673 | + { |
| 2674 | + "name": "stdout", |
| 2675 | + "output_type": "stream", |
| 2676 | + "text": [ |
| 2677 | + "/Users/khuyentran/book/Efficient_Python_tricks_and_tools_for_data_scientists/Chapter5/mlruns/0/1e20d72afccf450faa3b8a9806a97e83/artifacts/model\n", |
| 2678 | + "MLmodel model.pkl requirements.txt\n", |
| 2679 | + "conda.yaml python_env.yaml\n" |
| 2680 | + ] |
| 2681 | + } |
| 2682 | + ], |
| 2683 | + "source": [ |
| 2684 | + "%cd mlruns/0/1e20d72afccf450faa3b8a9806a97e83/artifacts/model\n", |
| 2685 | + "%ls" |
| 2686 | + ] |
| 2687 | + }, |
| 2688 | + { |
| 2689 | + "cell_type": "markdown", |
| 2690 | + "id": "ced58a8d", |
| 2691 | + "metadata": {}, |
| 2692 | + "source": [ |
| 2693 | + "The `MLmodel` file provides essential information about the model, including dependencies and input/output specifications:" |
| 2694 | + ] |
| 2695 | + }, |
| 2696 | + { |
| 2697 | + "cell_type": "code", |
| 2698 | + "execution_count": 23, |
| 2699 | + "id": "dc30e383", |
| 2700 | + "metadata": {}, |
| 2701 | + "outputs": [ |
| 2702 | + { |
| 2703 | + "name": "stdout", |
| 2704 | + "output_type": "stream", |
| 2705 | + "text": [ |
| 2706 | + "artifact_path: model\n", |
| 2707 | + "flavors:\n", |
| 2708 | + " python_function:\n", |
| 2709 | + " env:\n", |
| 2710 | + " conda: conda.yaml\n", |
| 2711 | + " virtualenv: python_env.yaml\n", |
| 2712 | + " loader_module: mlflow.sklearn\n", |
| 2713 | + " model_path: model.pkl\n", |
| 2714 | + " predict_fn: predict\n", |
| 2715 | + " python_version: 3.11.6\n", |
| 2716 | + " sklearn:\n", |
| 2717 | + " code: null\n", |
| 2718 | + " pickled_model: model.pkl\n", |
| 2719 | + " serialization_format: cloudpickle\n", |
| 2720 | + " sklearn_version: 1.4.1.post1\n", |
| 2721 | + "mlflow_version: 2.15.0\n", |
| 2722 | + "model_size_bytes: 722\n", |
| 2723 | + "model_uuid: e7487bc3c4ab417c965144efcecaca2f\n", |
| 2724 | + "run_id: 1e20d72afccf450faa3b8a9806a97e83\n", |
| 2725 | + "signature:\n", |
| 2726 | + " inputs: '[{\"type\": \"tensor\", \"tensor-spec\": {\"dtype\": \"int64\", \"shape\": [-1, 1]}}]'\n", |
| 2727 | + " outputs: '[{\"type\": \"tensor\", \"tensor-spec\": {\"dtype\": \"int64\", \"shape\": [-1]}}]'\n", |
| 2728 | + " params: null\n", |
| 2729 | + "utc_time_created: '2024-08-02 20:58:16.516963'\n" |
| 2730 | + ] |
| 2731 | + } |
| 2732 | + ], |
| 2733 | + "source": [ |
| 2734 | + "%cat MLmodel" |
| 2735 | + ] |
| 2736 | + }, |
| 2737 | + { |
| 2738 | + "cell_type": "markdown", |
| 2739 | + "id": "01ef7c12", |
| 2740 | + "metadata": {}, |
| 2741 | + "source": [ |
| 2742 | + "The `conda.yaml` and `python_env.yaml` files outline the environment dependencies, ensuring that the model runs in a consistent setup:" |
| 2743 | + ] |
| 2744 | + }, |
| 2745 | + { |
| 2746 | + "cell_type": "code", |
| 2747 | + "execution_count": 24, |
| 2748 | + "id": "1dce0181", |
| 2749 | + "metadata": {}, |
| 2750 | + "outputs": [ |
| 2751 | + { |
| 2752 | + "name": "stdout", |
| 2753 | + "output_type": "stream", |
| 2754 | + "text": [ |
| 2755 | + "channels:\n", |
| 2756 | + "- conda-forge\n", |
| 2757 | + "dependencies:\n", |
| 2758 | + "- python=3.11.6\n", |
| 2759 | + "- pip<=24.2\n", |
| 2760 | + "- pip:\n", |
| 2761 | + " - mlflow==2.15.0\n", |
| 2762 | + " - cloudpickle==2.2.1\n", |
| 2763 | + " - numpy==1.23.5\n", |
| 2764 | + " - psutil==5.9.6\n", |
| 2765 | + " - scikit-learn==1.4.1.post1\n", |
| 2766 | + " - scipy==1.11.3\n", |
| 2767 | + "name: mlflow-env\n" |
| 2768 | + ] |
| 2769 | + } |
| 2770 | + ], |
| 2771 | + "source": [ |
| 2772 | + "%cat conda.yaml" |
| 2773 | + ] |
| 2774 | + }, |
| 2775 | + { |
| 2776 | + "cell_type": "code", |
| 2777 | + "execution_count": 25, |
| 2778 | + "id": "16c2d3bc", |
| 2779 | + "metadata": {}, |
| 2780 | + "outputs": [ |
| 2781 | + { |
| 2782 | + "name": "stdout", |
| 2783 | + "output_type": "stream", |
| 2784 | + "text": [ |
| 2785 | + "python: 3.11.6\n", |
| 2786 | + "build_dependencies:\n", |
| 2787 | + "- pip==24.2\n", |
| 2788 | + "- setuptools\n", |
| 2789 | + "- wheel==0.40.0\n", |
| 2790 | + "dependencies:\n", |
| 2791 | + "- -r requirements.txt\n" |
| 2792 | + ] |
| 2793 | + } |
| 2794 | + ], |
| 2795 | + "source": [ |
| 2796 | + "%cat python_env.yaml\n" |
| 2797 | + ] |
| 2798 | + }, |
| 2799 | + { |
| 2800 | + "cell_type": "code", |
| 2801 | + "execution_count": 26, |
| 2802 | + "id": "b16b2916", |
| 2803 | + "metadata": {}, |
| 2804 | + "outputs": [ |
| 2805 | + { |
| 2806 | + "name": "stdout", |
| 2807 | + "output_type": "stream", |
| 2808 | + "text": [ |
| 2809 | + "mlflow==2.15.0\n", |
| 2810 | + "cloudpickle==2.2.1\n", |
| 2811 | + "numpy==1.23.5\n", |
| 2812 | + "psutil==5.9.6\n", |
| 2813 | + "scikit-learn==1.4.1.post1\n", |
| 2814 | + "scipy==1.11.3" |
| 2815 | + ] |
| 2816 | + } |
| 2817 | + ], |
| 2818 | + "source": [ |
| 2819 | + "%cat requirements.txt" |
| 2820 | + ] |
| 2821 | + }, |
| 2822 | + { |
| 2823 | + "cell_type": "markdown", |
| 2824 | + "id": "00f7bbf0", |
| 2825 | + "metadata": {}, |
| 2826 | + "source": [ |
| 2827 | + "[Learn more about MLFlow Models](https://bit.ly/46y6gpF)." |
| 2828 | + ] |
2573 | 2829 | }
|
2574 | 2830 | ],
|
2575 | 2831 | "metadata": {
|
|
2590 | 2846 | "name": "python",
|
2591 | 2847 | "nbconvert_exporter": "python",
|
2592 | 2848 | "pygments_lexer": "ipython3",
|
2593 |
| - "version": "3.11.4" |
| 2849 | + "version": "3.11.6" |
2594 | 2850 | },
|
2595 | 2851 | "toc": {
|
2596 | 2852 | "base_numbering": 1,
|
|
0 commit comments