Merge pull request #731 from jasonnovichRunAI/v2.17-RUN-16796-Omit-unsupported-knative-metrics

jasonnovichRunAI · web-flow · commit 1fb9bc84d91f · 2024-04-08T11:57:11.000+03:00
RUN-16796 remove unsupported metrics in autoscaling
diff --git a/docs/Researcher/Walkthroughs/quickstart-inference.md b/docs/Researcher/Walkthroughs/quickstart-inference.md
@@ -2,44 +2,43 @@
 
 ## Introduction
 
-Machine learning (ML) inference is the process of running live data points into a machine-learning algorithm to calculate an output. 
+Machine learning (ML) inference is the process of running live data points into a machine-learning algorithm to calculate an output.
 
-With Inference, you are taking a trained _Model_ and deploying it into a production environment. The deployment must align with the organization's production standards such as average and 95% response time as well as up-time. 
+With Inference, you are taking a trained *Model* and deploying it into a production environment. The deployment must align with the organization's production standards such as average and 95% response time as well as up-time.
 
-## Prerequisites 
+## Prerequisites
 
 To complete this Quickstart you must have:
 
-* Run:ai software installed on your Kubernetes cluster. See: [Installing Run:ai on a Kubernetes Cluster](../../admin/runai-setup/installation-types.md). There are additional prerequisites for running inference. See [cluster installation prerequisites](../../admin/runai-setup/cluster-setup/cluster-prerequisites.md#inference) for more information. 
+* Run:ai software installed on your Kubernetes cluster. See: [Installing Run:ai on a Kubernetes Cluster](../../admin/runai-setup/installation-types.md). There are additional prerequisites for running inference. See [cluster installation prerequisites](../../admin/runai-setup/cluster-setup/cluster-prerequisites.md#inference) for more information.
 * Run:ai CLI installed on your machine. See: [Installing the Run:ai Command-Line Interface](../../admin/researcher-setup/cli-install.md)
-* You must have _ML Engineer_ access rights. See [Adding, Updating and Deleting Users](../../admin/admin-ui-setup/admin-ui-users.md) for more information. 
+* You must have *ML Engineer* access rights. See [Adding, Updating and Deleting Users](../../admin/admin-ui-setup/admin-ui-users.md) for more information.
 
 ## Step by Step Walkthrough
 
 ### Setup
 
-*  Login to the Projects area of the Run:ai user interface.
-*  Add a Project named "team-a".
-*  Allocate 2 GPUs to the Project.
+* Login to the Projects area of the Run:ai user interface.
+* Add a Project named "team-a".
+* Allocate 2 GPUs to the Project.
 
-### Run an Inference Workload 
+### Run an Inference Workload
 
-*   In the Run:ai user interface go to `Deployments`. If you do not see the `Deployments` section you may not have the required access control, or the inference module is disabled. 
+* In the Run:ai user interface go to `Deployments`. If you do not see the `Deployments` section you may not have the required access control, or the inference module is disabled.
 * Select `New Deployment` on the top right.
 * Select `team-a` as a project and add an arbitrary name. Use the image `gcr.io/run-ai-demo/example-triton-server`.
 * Under `Resources` add 0.5 GPUs.
-* Under `Auto Scaling` select a minimum of 1, a maximum of 2. Use the `concurrency` autoscaling threshold method. Add a threshold of 3.
+* Under `Autoscaling` select a minimum of 1, a maximum of 2. Use the `concurrency` autoscaling threshold method. Add a threshold of 3.
 * Add a `Container port` of `8000`.
 
-
 This would start an inference workload for team-a with an allocation of a single GPU. Follow up on the Job's progress using the [Deployment list](../../admin/admin-ui-setup/deployments.md) in the user interface or by running `runai list jobs`
 
 ### Query the Inference Server
 
 The specific inference server we just created is accepting queries over port 8000. You can use the Run:ai Triton demo client to send requests to the server:
 
 * Find an IP address by running `kubectl get svc -n runai-team-a`. Use the `inference1-00001-private` Cluster IP.
-* Replace `<IP>` below and run: 
+* Replace `<IP>` below and run:
 
 ```
  runai submit inference-client  -i gcr.io/run-ai-demo/example-triton-client \
@@ -52,11 +51,10 @@ The specific inference server we just created is accepting queries over port 800
 runai logs inference-client
 ```
 
-
 ### View status on the Run:ai User Interface
 
 * Open the Run:ai user interface.
-* Under _Deployments_ you can view the new Workload. When clicking the workload, note the utilization graphs go up. 
+* Under *Deployments* you can view the new Workload. When clicking the workload, note the utilization graphs go up.
 
 ### Stop Workload
 
@@ -66,4 +64,3 @@ Use the user interface to delete the workload.
 
 * You can also create Inference deployments via API. For more information see [Submitting Workloads via YAML](../../developer/cluster-api/submit-yaml.md).
 * See [Deployment](../../admin/admin-ui-setup/deployments.md) user interface.
-
diff --git a/docs/admin/runai-setup/cluster-setup/cluster-prerequisites.md b/docs/admin/runai-setup/cluster-setup/cluster-prerequisites.md
@@ -1,4 +1,11 @@
-Below are the prerequisites of a cluster installed with Run:ai.
+---
+title: Prerequisites in a nutshell
+summary: This article outlines the required prerequisites for a Run:ai installation.
+authors:
+    - Jason Novich
+    - Yaron Goldberg
+date: 2024-Apr-8
+---
 
 ## Prerequisites in a Nutshell
 
@@ -63,7 +70,6 @@ For an up-to-date end-of-life statement of Kubernetes see [Kubernetes Release Hi
 
 #### Pod Security Admission
 
-
 Run:ai version 2.15 and above supports `restricted` policy for [Pod Security Admission](https://kubernetes.io/docs/concepts/security/pod-security-admission/){target=_blank} (PSA) on OpenShift only. Other Kubernetes distributions are only supported with `Privileged` policy.
 
 For Run:ai on OpenShift to run with PSA `restricted` policy:
@@ -75,8 +81,9 @@ For Run:ai on OpenShift to run with PSA `restricted` policy:
    pod-security.kubernetes.io/enforce=privileged
    pod-security.kubernetes.io/warn=privileged
    ```
+
 2. The workloads submitted through Run:ai should comply with the restrictions of PSA `restricted` policy, which are dropping all Linux capabilities and setting `runAsNonRoot` to `true`. This can be done and enforced using [Policies](../../workloads/policies/policies.md).
-   
+
 ### NVIDIA
 
 Run:ai has been certified on **NVIDIA GPU Operator**  22.9 to 23.9. Older versions (1.10 and 1.11) have a documented [NVIDIA issue](https://github.com/NVIDIA/gpu-feature-discovery/issues/26){target=_blank}.
@@ -123,7 +130,7 @@ Follow the [Getting Started guide](https://docs.nvidia.com/datacenter/cloud-nati
 
 === "RKE2"
     * Follow the [Getting Started guide](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/getting-started.html#rancher-kubernetes-engine-2){target=blank} to install the NVIDIA GPU Operator.
-    * Make sure to specify the `CONTAINERD_CONFIG` option exactly with the value specified in the document `/var/lib/rancher/rke2/agent/etc/containerd/config.toml.tmpl` even though the file may not exist in your system. 
+    * Make sure to specify the `CONTAINERD_CONFIG` option exactly with the value specified in the document `/var/lib/rancher/rke2/agent/etc/containerd/config.toml.tmpl` even though the file may not exist in your system.
 
 <!-- 
 === "RKE2"
@@ -270,22 +277,20 @@ kubectl patch configmap/config-features \
 
 #### Inference Autoscaling
 
-Run:ai allows to autoscale a deployment according to various metrics:
+Run:ai allows to autoscale a deployment using the following metrics:
 
-1. GPU Utilization (%)
-2. CPU Utilization (%)
-3. Latency (milliseconds)
-4. Throughput (requests/second)
-5. Concurrency
-6. Any custom metric
+1. Throughput (requests/second)
+2. Concurrency
 
+<!--
 Additional installation may be needed for some of the metrics as follows:
 
 * Using *Throughput* or *Concurrency* does not require any additional installation.
 * Any other metric will require installing the [HPA Autoscaler](https://knative.dev/docs/install/yaml-install/serving/install-serving-with-yaml/#install-optional-serving-extensions){target=_blank}.
 * Using *GPU Utilization_, *Latency* or *Custom metric* will **also** require the Prometheus adapter. The Prometheus adapter is part of the Run:ai installer and can be added by setting the `prometheus-adapter.enabled` flag to `true`. See [Customizing the Run:ai installation](./customize-cluster-install.md) for further information.
 
 If you wish to use an *existing* Prometheus adapter installation, you will need to configure it manually with the Run:ai Prometheus rules, specified in the Run:ai chart values under `prometheus-adapter.rules` field. For further information please contact Run:ai customer support.
+-->
 
 #### Accessing Inference from outside the Cluster
 
@@ -304,7 +309,7 @@ However, for the URL to be accessible outside the cluster you must configure you
         -H 'Host: <host-name>'
     ```
 
-# Hardware Requirements
+## Hardware Requirements
 
 (see picture below)
 
diff --git a/docs/admin/workloads/inference-overview.md b/docs/admin/workloads/inference-overview.md
@@ -1,53 +1,52 @@
-
 ---
-title: Inference  Overview 
+title: Inference overview
+summary: This article describes inference workloads.
+authors:
+    - Jason Novich
+date: 2024-Mar-29
 ---
+
 ## What is Inference
 
-Machine learning (ML) inference is the process of running live data points into a machine-learning algorithm to calculate an output. 
+Machine learning (ML) inference is the process of running live data points into a machine-learning algorithm to calculate an output.
 
-With Inference, you are taking a trained _Model_ and deploying it into a production environment. The deployment must align with the organization's production standards such as average and 95% response time as well as up-time. 
+With Inference, you are taking a trained *Model* and deploying it into a production environment. The deployment must align with the organization's production standards such as average and 95% response time as well as up-time.
 
 ## Inference and GPUs
- 
-The Inference process is a subset of the original Training algorithm on a single datum (e.g. one sentence or one image), or a small batch. As such, GPU memory requirements are typically smaller than a full-blown Training process. 
 
-Given that, Inference lends itself nicely to the usage of Run:ai Fractions. You can, for example, run 4 instances of an Inference server on a single GPU, each employing a fourth of the memory. 
+The inference process is a subset of the original training algorithm on a single datum (for example, one sentence or one image), or a small batch. As such, GPU memory requirements are typically smaller than a full-blown Training process.
+
+Given that, Inference lends itself nicely to the usage of Run:ai Fractions. You can, for example, run 4 instances of an Inference server on a single GPU, each employing a fourth of the memory.
 
 ## Inference @Run:ai
 
-Run:ai provides Inference services as an equal part together with the other two Workload types: _Train_ and _Build_.
+Run:ai provides Inference services as an equal part together with any other Workload type that is available.
 
-* Inference is considered a high-priority workload as it is customer-facing. Running an Inference workload (within the Project's quota) will preempt any Run:ai Workload marked as _Training_.
+* Inference is considered a high-priority workload as it is customer-facing. Running an Inference workload (within the Project's quota) will preempt any Run:ai Workload marked as *Training*.
 
-* Inference workloads will receive priority over _Train_ and _Build_ workloads during scheduling.
+* Inference workloads will receive priority over *Train* and *Build* workloads during scheduling.
 
-* Inference is implemented as a Kubernetes _Deployment_ object with a defined number of replicas. The replicas are load-balanced by Kubernetes so adding more replicas will improve the overall throughput of the system.
+* Inference is implemented as a Kubernetes *Deployment* object with a defined number of replicas. The replicas are load-balanced by Kubernetes so adding more replicas will improve the overall throughput of the system.
 
 * Multiple replicas will appear in Run:ai as a single Inference workload. The workload will appear in all Run:ai dashboards and views as well as the Command-line interface.
 
-* Inference workloads can be submitted via Run:ai [user interface](../admin-ui-setup/deployments.md) as well as [Run:ai API](../../developer/cluster-api/workload-overview-dev.md). Internally, spawning an Inference workload also creates a Kubernetes _Service_. The service is an end-point to which clients can connect. 
+* Inference workloads can be submitted via Run:ai [user interface](../admin-ui-setup/deployments.md) as well as [Run:ai API](../../developer/cluster-api/workload-overview-dev.md). Internally, spawning an Inference workload also creates a Kubernetes *Service*. The service is an end-point to which clients can connect.
 
-## Auto Scaling
+## Autoscaling
 
-To withstand SLA, Inference workloads are typically set with _auto scaling_. Auto-scaling is the ability to add more computing power (Kubernetes pods) when the load increases and shrink allocated resources when the system is idle.
+To withstand SLA, Inference workloads are typically set with *autoscaling*. Autoscaling is the ability to add more computing power (Kubernetes pods) when the load increases and shrink allocated resources when the system is idle.
 
-There are a number of ways to trigger auto-scaling. Run:ai supports the following:
+There are a number of ways to trigger autoscaling. Run:ai supports the following:
 
 | Metric          | Units        |   Run:ai name   |
 |-----------------|--------------|-----------------|
-| GPU Utilization |   %          | gpu-utilization |
-| CPU Utilization |   %          | cpu-utilization |
-| Latency         | milliseconds | latency         |
 | Throughput      | requests/second | throughput |
-| Concurrency     |              |    concurrency  | 
-| Custom metric   |              |    custom       |
+| Concurrency     |              |    concurrency  |
 
 The Minimum and Maximum number of replicas can be configured as part of the autoscaling configuration.
 
-Auto Scaling also supports a scale to zero policy with _Throughput_ and _Concurrency_ metrics, meaning that given enough time under the target threshold, the number of replicas will be scaled down to 0.
-This has the benefit of conserving resources at the risk of a delay from "cold starting" the model when traffic resumes. 
-
+Autoscaling also supports a scale to zero policy with *Throughput* and *Concurrency* metrics, meaning that given enough time under the target threshold, the number of replicas will be scaled down to 0.
+This has the benefit of conserving resources at the risk of a delay from "cold starting" the model when traffic resumes.
 
 ## See Also