Skip to content

Commit 8d09bc7

Browse files
Merge pull request #735 from jasonnovichRunAI/v2.17-RUN-15154-inference-workloads
V2.17-RUN-15154-inference-workloads
2 parents 6bf8771 + 3b51f48 commit 8d09bc7

File tree

3 files changed

+108
-57
lines changed

3 files changed

+108
-57
lines changed

.github/workflows/publish-docs.yaml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -33,13 +33,13 @@ jobs:
3333
runs-on: ubuntu-latest
3434
steps:
3535
- name: checkout latest
36-
uses: actions/checkout@v3
36+
uses: actions/checkout@v4
3737
with:
3838
ref: ${{ inputs.version }}
3939
fetch-depth: 0
4040

4141
- name: setup python
42-
uses: actions/setup-python@v4
42+
uses: actions/setup-python@v5
4343
with:
4444
python-version: '3.9'
4545
cache: 'pip' # caching pip dependencies
Lines changed: 15 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
---
22
title: Inference overview
3-
summary: This article describes inference workloads.
3+
summary: This article summarizes machine learning inference workloads.
44
authors:
55
- Jason Novich
66
date: 2024-Mar-29
@@ -10,31 +10,31 @@ date: 2024-Mar-29
1010

1111
Machine learning (ML) inference is the process of running live data points into a machine-learning algorithm to calculate an output.
1212

13-
With Inference, you are taking a trained *Model* and deploying it into a production environment. The deployment must align with the organization's production standards such as average and 95% response time as well as up-time.
13+
With *Inference* workloads, you are taking a trained *Model* and deploying it into a production environment. The deployment must align with the organization's production standards such as average and 95% response time as well as up-time.
1414

1515
## Inference and GPUs
1616

17-
The inference process is a subset of the original training algorithm on a single datum (for example, one sentence or one image), or a small batch. As such, GPU memory requirements are typically smaller than a full-blown Training process.
17+
The *Inference* process is a subset of the original Training algorithm on a single datum (e.g. one sentence or one image), or a small batch. As such, GPU memory requirements are typically smaller than a full-blown Training process.
1818

19-
Given that, Inference lends itself nicely to the usage of Run:ai Fractions. You can, for example, run 4 instances of an Inference server on a single GPU, each employing a fourth of the memory.
19+
Given that, *Inference* lends itself nicely to the usage of Run:ai Fractions. You can, for example, run 4 instances of an *Inference* server on a single GPU, each employing a fourth of the memory.
2020

2121
## Inference @Run:ai
2222

23-
Run:ai provides Inference services as an equal part together with any other Workload type that is available.
23+
Run:ai provides *Inference* services as an equal part together with the other two Workload types: *Train* and *Build*.
2424

25-
* Inference is considered a high-priority workload as it is customer-facing. Running an Inference workload (within the Project's quota) will preempt any Run:ai Workload marked as *Training*.
25+
* *Inference* is considered a high-priority workload as it is customer-facing. Running an *Inference* workload (within the Project's quota) will preempt any Run:ai Workload marked as *Training*.
2626

27-
* Inference workloads will receive priority over *Train* and *Build* workloads during scheduling.
27+
* *Inference* workloads will receive priority over *Train* and *Build* workloads during scheduling.
2828

29-
* Inference is implemented as a Kubernetes *Deployment* object with a defined number of replicas. The replicas are load-balanced by Kubernetes so adding more replicas will improve the overall throughput of the system.
29+
* *Inference* is implemented as a Kubernetes *Deployment* object with a defined number of replicas. The replicas are load-balanced by Kubernetes so adding more replicas will improve the overall throughput of the system.
3030

31-
* Multiple replicas will appear in Run:ai as a single Inference workload. The workload will appear in all Run:ai dashboards and views as well as the Command-line interface.
31+
* Multiple replicas will appear in Run:ai as a single *Inference* workload. The workload will appear in all Run:ai dashboards and views as well as the Command-line interface.
3232

3333
* Inference workloads can be submitted via Run:ai [user interface](../admin-ui-setup/deployments.md) as well as [Run:ai API](../../developer/cluster-api/workload-overview-dev.md). Internally, spawning an Inference workload also creates a Kubernetes *Service*. The service is an end-point to which clients can connect.
3434

3535
## Autoscaling
3636

37-
To withstand SLA, Inference workloads are typically set with *autoscaling*. Autoscaling is the ability to add more computing power (Kubernetes pods) when the load increases and shrink allocated resources when the system is idle.
37+
To withstand SLA, *Inference* workloads are typically set with *auto scaling*. Auto-scaling is the ability to add more computing power (Kubernetes pods) when the load increases and shrink allocated resources when the system is idle.
3838

3939
There are a number of ways to trigger autoscaling. Run:ai supports the following:
4040

@@ -46,11 +46,12 @@ There are a number of ways to trigger autoscaling. Run:ai supports the following
4646
The Minimum and Maximum number of replicas can be configured as part of the autoscaling configuration.
4747

4848
Autoscaling also supports a scale to zero policy with *Throughput* and *Concurrency* metrics, meaning that given enough time under the target threshold, the number of replicas will be scaled down to 0.
49+
4950
This has the benefit of conserving resources at the risk of a delay from "cold starting" the model when traffic resumes.
5051

5152
## See Also
5253

53-
* To set up Inference, see [Cluster installation prerequisites](../runai-setup/cluster-setup/cluster-prerequisites.md#inference).
54-
* For running Inference see [Inference quick-start](../../Researcher/Walkthroughs/quickstart-inference.md).
55-
* To run Inference from the user interface see [Deployments](../admin-ui-setup/deployments.md).
56-
* To run Inference using API see [Workload overview](../../developer/cluster-api/workload-overview-dev.md).
54+
* To set up *Inference*, see [Cluster installation prerequisites](../runai-setup/cluster-setup/cluster-prerequisites.md#inference).
55+
* For running *Inference* see [Inference quick-start](../../Researcher/Walkthroughs/quickstart-inference.md).
56+
* To run *Inference* from the user interface see [Deployments](../admin-ui-setup/deployments.md).
57+
* To run *Inference* using API see [Workload overview](../../developer/cluster-api/workload-overview-dev.md).

0 commit comments

Comments
 (0)