Skip to content

Commit 1fb9bc8

Browse files
Merge pull request #731 from jasonnovichRunAI/v2.17-RUN-16796-Omit-unsupported-knative-metrics
RUN-16796 remove unsupported metrics in autoscaling
2 parents 653defa + ccc5e06 commit 1fb9bc8

File tree

3 files changed

+52
-51
lines changed

3 files changed

+52
-51
lines changed

docs/Researcher/Walkthroughs/quickstart-inference.md

Lines changed: 13 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -2,44 +2,43 @@
22

33
## Introduction
44

5-
Machine learning (ML) inference is the process of running live data points into a machine-learning algorithm to calculate an output.
5+
Machine learning (ML) inference is the process of running live data points into a machine-learning algorithm to calculate an output.
66

7-
With Inference, you are taking a trained _Model_ and deploying it into a production environment. The deployment must align with the organization's production standards such as average and 95% response time as well as up-time.
7+
With Inference, you are taking a trained *Model* and deploying it into a production environment. The deployment must align with the organization's production standards such as average and 95% response time as well as up-time.
88

9-
## Prerequisites
9+
## Prerequisites
1010

1111
To complete this Quickstart you must have:
1212

13-
* Run:ai software installed on your Kubernetes cluster. See: [Installing Run:ai on a Kubernetes Cluster](../../admin/runai-setup/installation-types.md). There are additional prerequisites for running inference. See [cluster installation prerequisites](../../admin/runai-setup/cluster-setup/cluster-prerequisites.md#inference) for more information.
13+
* Run:ai software installed on your Kubernetes cluster. See: [Installing Run:ai on a Kubernetes Cluster](../../admin/runai-setup/installation-types.md). There are additional prerequisites for running inference. See [cluster installation prerequisites](../../admin/runai-setup/cluster-setup/cluster-prerequisites.md#inference) for more information.
1414
* Run:ai CLI installed on your machine. See: [Installing the Run:ai Command-Line Interface](../../admin/researcher-setup/cli-install.md)
15-
* You must have _ML Engineer_ access rights. See [Adding, Updating and Deleting Users](../../admin/admin-ui-setup/admin-ui-users.md) for more information.
15+
* You must have *ML Engineer* access rights. See [Adding, Updating and Deleting Users](../../admin/admin-ui-setup/admin-ui-users.md) for more information.
1616

1717
## Step by Step Walkthrough
1818

1919
### Setup
2020

21-
* Login to the Projects area of the Run:ai user interface.
22-
* Add a Project named "team-a".
23-
* Allocate 2 GPUs to the Project.
21+
* Login to the Projects area of the Run:ai user interface.
22+
* Add a Project named "team-a".
23+
* Allocate 2 GPUs to the Project.
2424

25-
### Run an Inference Workload
25+
### Run an Inference Workload
2626

27-
* In the Run:ai user interface go to `Deployments`. If you do not see the `Deployments` section you may not have the required access control, or the inference module is disabled.
27+
* In the Run:ai user interface go to `Deployments`. If you do not see the `Deployments` section you may not have the required access control, or the inference module is disabled.
2828
* Select `New Deployment` on the top right.
2929
* Select `team-a` as a project and add an arbitrary name. Use the image `gcr.io/run-ai-demo/example-triton-server`.
3030
* Under `Resources` add 0.5 GPUs.
31-
* Under `Auto Scaling` select a minimum of 1, a maximum of 2. Use the `concurrency` autoscaling threshold method. Add a threshold of 3.
31+
* Under `Autoscaling` select a minimum of 1, a maximum of 2. Use the `concurrency` autoscaling threshold method. Add a threshold of 3.
3232
* Add a `Container port` of `8000`.
3333

34-
3534
This would start an inference workload for team-a with an allocation of a single GPU. Follow up on the Job's progress using the [Deployment list](../../admin/admin-ui-setup/deployments.md) in the user interface or by running `runai list jobs`
3635

3736
### Query the Inference Server
3837

3938
The specific inference server we just created is accepting queries over port 8000. You can use the Run:ai Triton demo client to send requests to the server:
4039

4140
* Find an IP address by running `kubectl get svc -n runai-team-a`. Use the `inference1-00001-private` Cluster IP.
42-
* Replace `<IP>` below and run:
41+
* Replace `<IP>` below and run:
4342

4443
```
4544
runai submit inference-client -i gcr.io/run-ai-demo/example-triton-client \
@@ -52,11 +51,10 @@ The specific inference server we just created is accepting queries over port 800
5251
runai logs inference-client
5352
```
5453

55-
5654
### View status on the Run:ai User Interface
5755

5856
* Open the Run:ai user interface.
59-
* Under _Deployments_ you can view the new Workload. When clicking the workload, note the utilization graphs go up.
57+
* Under *Deployments* you can view the new Workload. When clicking the workload, note the utilization graphs go up.
6058

6159
### Stop Workload
6260

@@ -66,4 +64,3 @@ Use the user interface to delete the workload.
6664

6765
* You can also create Inference deployments via API. For more information see [Submitting Workloads via YAML](../../developer/cluster-api/submit-yaml.md).
6866
* See [Deployment](../../admin/admin-ui-setup/deployments.md) user interface.
69-

docs/admin/runai-setup/cluster-setup/cluster-prerequisites.md

Lines changed: 17 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,11 @@
1-
Below are the prerequisites of a cluster installed with Run:ai.
1+
---
2+
title: Prerequisites in a nutshell
3+
summary: This article outlines the required prerequisites for a Run:ai installation.
4+
authors:
5+
- Jason Novich
6+
- Yaron Goldberg
7+
date: 2024-Apr-8
8+
---
29

310
## Prerequisites in a Nutshell
411

@@ -63,7 +70,6 @@ For an up-to-date end-of-life statement of Kubernetes see [Kubernetes Release Hi
6370

6471
#### Pod Security Admission
6572

66-
6773
Run:ai version 2.15 and above supports `restricted` policy for [Pod Security Admission](https://kubernetes.io/docs/concepts/security/pod-security-admission/){target=_blank} (PSA) on OpenShift only. Other Kubernetes distributions are only supported with `Privileged` policy.
6874

6975
For Run:ai on OpenShift to run with PSA `restricted` policy:
@@ -75,8 +81,9 @@ For Run:ai on OpenShift to run with PSA `restricted` policy:
7581
pod-security.kubernetes.io/enforce=privileged
7682
pod-security.kubernetes.io/warn=privileged
7783
```
84+
7885
2. The workloads submitted through Run:ai should comply with the restrictions of PSA `restricted` policy, which are dropping all Linux capabilities and setting `runAsNonRoot` to `true`. This can be done and enforced using [Policies](../../workloads/policies/policies.md).
79-
86+
8087
### NVIDIA
8188

8289
Run:ai has been certified on **NVIDIA GPU Operator** 22.9 to 23.9. Older versions (1.10 and 1.11) have a documented [NVIDIA issue](https://github.com/NVIDIA/gpu-feature-discovery/issues/26){target=_blank}.
@@ -123,7 +130,7 @@ Follow the [Getting Started guide](https://docs.nvidia.com/datacenter/cloud-nati
123130

124131
=== "RKE2"
125132
* Follow the [Getting Started guide](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/getting-started.html#rancher-kubernetes-engine-2){target=blank} to install the NVIDIA GPU Operator.
126-
* Make sure to specify the `CONTAINERD_CONFIG` option exactly with the value specified in the document `/var/lib/rancher/rke2/agent/etc/containerd/config.toml.tmpl` even though the file may not exist in your system.
133+
* Make sure to specify the `CONTAINERD_CONFIG` option exactly with the value specified in the document `/var/lib/rancher/rke2/agent/etc/containerd/config.toml.tmpl` even though the file may not exist in your system.
127134

128135
<!--
129136
=== "RKE2"
@@ -270,22 +277,20 @@ kubectl patch configmap/config-features \
270277

271278
#### Inference Autoscaling
272279

273-
Run:ai allows to autoscale a deployment according to various metrics:
280+
Run:ai allows to autoscale a deployment using the following metrics:
274281

275-
1. GPU Utilization (%)
276-
2. CPU Utilization (%)
277-
3. Latency (milliseconds)
278-
4. Throughput (requests/second)
279-
5. Concurrency
280-
6. Any custom metric
282+
1. Throughput (requests/second)
283+
2. Concurrency
281284

285+
<!--
282286
Additional installation may be needed for some of the metrics as follows:
283287
284288
* Using *Throughput* or *Concurrency* does not require any additional installation.
285289
* Any other metric will require installing the [HPA Autoscaler](https://knative.dev/docs/install/yaml-install/serving/install-serving-with-yaml/#install-optional-serving-extensions){target=_blank}.
286290
* Using *GPU Utilization_, *Latency* or *Custom metric* will **also** require the Prometheus adapter. The Prometheus adapter is part of the Run:ai installer and can be added by setting the `prometheus-adapter.enabled` flag to `true`. See [Customizing the Run:ai installation](./customize-cluster-install.md) for further information.
287291
288292
If you wish to use an *existing* Prometheus adapter installation, you will need to configure it manually with the Run:ai Prometheus rules, specified in the Run:ai chart values under `prometheus-adapter.rules` field. For further information please contact Run:ai customer support.
293+
-->
289294

290295
#### Accessing Inference from outside the Cluster
291296

@@ -304,7 +309,7 @@ However, for the URL to be accessible outside the cluster you must configure you
304309
-H 'Host: <host-name>'
305310
```
306311

307-
# Hardware Requirements
312+
## Hardware Requirements
308313

309314
(see picture below)
310315

docs/admin/workloads/inference-overview.md

Lines changed: 22 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -1,53 +1,52 @@
1-
21
---
3-
title: Inference Overview
2+
title: Inference overview
3+
summary: This article describes inference workloads.
4+
authors:
5+
- Jason Novich
6+
date: 2024-Mar-29
47
---
8+
59
## What is Inference
610

7-
Machine learning (ML) inference is the process of running live data points into a machine-learning algorithm to calculate an output.
11+
Machine learning (ML) inference is the process of running live data points into a machine-learning algorithm to calculate an output.
812

9-
With Inference, you are taking a trained _Model_ and deploying it into a production environment. The deployment must align with the organization's production standards such as average and 95% response time as well as up-time.
13+
With Inference, you are taking a trained *Model* and deploying it into a production environment. The deployment must align with the organization's production standards such as average and 95% response time as well as up-time.
1014

1115
## Inference and GPUs
12-
13-
The Inference process is a subset of the original Training algorithm on a single datum (e.g. one sentence or one image), or a small batch. As such, GPU memory requirements are typically smaller than a full-blown Training process.
1416

15-
Given that, Inference lends itself nicely to the usage of Run:ai Fractions. You can, for example, run 4 instances of an Inference server on a single GPU, each employing a fourth of the memory.
17+
The inference process is a subset of the original training algorithm on a single datum (for example, one sentence or one image), or a small batch. As such, GPU memory requirements are typically smaller than a full-blown Training process.
18+
19+
Given that, Inference lends itself nicely to the usage of Run:ai Fractions. You can, for example, run 4 instances of an Inference server on a single GPU, each employing a fourth of the memory.
1620

1721
## Inference @Run:ai
1822

19-
Run:ai provides Inference services as an equal part together with the other two Workload types: _Train_ and _Build_.
23+
Run:ai provides Inference services as an equal part together with any other Workload type that is available.
2024

21-
* Inference is considered a high-priority workload as it is customer-facing. Running an Inference workload (within the Project's quota) will preempt any Run:ai Workload marked as _Training_.
25+
* Inference is considered a high-priority workload as it is customer-facing. Running an Inference workload (within the Project's quota) will preempt any Run:ai Workload marked as *Training*.
2226

23-
* Inference workloads will receive priority over _Train_ and _Build_ workloads during scheduling.
27+
* Inference workloads will receive priority over *Train* and *Build* workloads during scheduling.
2428

25-
* Inference is implemented as a Kubernetes _Deployment_ object with a defined number of replicas. The replicas are load-balanced by Kubernetes so adding more replicas will improve the overall throughput of the system.
29+
* Inference is implemented as a Kubernetes *Deployment* object with a defined number of replicas. The replicas are load-balanced by Kubernetes so adding more replicas will improve the overall throughput of the system.
2630

2731
* Multiple replicas will appear in Run:ai as a single Inference workload. The workload will appear in all Run:ai dashboards and views as well as the Command-line interface.
2832

29-
* Inference workloads can be submitted via Run:ai [user interface](../admin-ui-setup/deployments.md) as well as [Run:ai API](../../developer/cluster-api/workload-overview-dev.md). Internally, spawning an Inference workload also creates a Kubernetes _Service_. The service is an end-point to which clients can connect.
33+
* Inference workloads can be submitted via Run:ai [user interface](../admin-ui-setup/deployments.md) as well as [Run:ai API](../../developer/cluster-api/workload-overview-dev.md). Internally, spawning an Inference workload also creates a Kubernetes *Service*. The service is an end-point to which clients can connect.
3034

31-
## Auto Scaling
35+
## Autoscaling
3236

33-
To withstand SLA, Inference workloads are typically set with _auto scaling_. Auto-scaling is the ability to add more computing power (Kubernetes pods) when the load increases and shrink allocated resources when the system is idle.
37+
To withstand SLA, Inference workloads are typically set with *autoscaling*. Autoscaling is the ability to add more computing power (Kubernetes pods) when the load increases and shrink allocated resources when the system is idle.
3438

35-
There are a number of ways to trigger auto-scaling. Run:ai supports the following:
39+
There are a number of ways to trigger autoscaling. Run:ai supports the following:
3640

3741
| Metric | Units | Run:ai name |
3842
|-----------------|--------------|-----------------|
39-
| GPU Utilization | % | gpu-utilization |
40-
| CPU Utilization | % | cpu-utilization |
41-
| Latency | milliseconds | latency |
4243
| Throughput | requests/second | throughput |
43-
| Concurrency | | concurrency |
44-
| Custom metric | | custom |
44+
| Concurrency | | concurrency |
4545

4646
The Minimum and Maximum number of replicas can be configured as part of the autoscaling configuration.
4747

48-
Auto Scaling also supports a scale to zero policy with _Throughput_ and _Concurrency_ metrics, meaning that given enough time under the target threshold, the number of replicas will be scaled down to 0.
49-
This has the benefit of conserving resources at the risk of a delay from "cold starting" the model when traffic resumes.
50-
48+
Autoscaling also supports a scale to zero policy with *Throughput* and *Concurrency* metrics, meaning that given enough time under the target threshold, the number of replicas will be scaled down to 0.
49+
This has the benefit of conserving resources at the risk of a delay from "cold starting" the model when traffic resumes.
5150

5251
## See Also
5352

0 commit comments

Comments
 (0)