|
1 |
| - |
2 | 1 | ---
|
3 |
| -title: Inference Overview |
| 2 | +title: Inference overview |
| 3 | +summary: This article describes inference workloads. |
| 4 | +authors: |
| 5 | + - Jason Novich |
| 6 | +date: 2024-Mar-29 |
4 | 7 | ---
|
| 8 | + |
5 | 9 | ## What is Inference
|
6 | 10 |
|
7 |
| -Machine learning (ML) inference is the process of running live data points into a machine-learning algorithm to calculate an output. |
| 11 | +Machine learning (ML) inference is the process of running live data points into a machine-learning algorithm to calculate an output. |
8 | 12 |
|
9 |
| -With Inference, you are taking a trained _Model_ and deploying it into a production environment. The deployment must align with the organization's production standards such as average and 95% response time as well as up-time. |
| 13 | +With Inference, you are taking a trained *Model* and deploying it into a production environment. The deployment must align with the organization's production standards such as average and 95% response time as well as up-time. |
10 | 14 |
|
11 | 15 | ## Inference and GPUs
|
12 |
| - |
13 |
| -The Inference process is a subset of the original Training algorithm on a single datum (e.g. one sentence or one image), or a small batch. As such, GPU memory requirements are typically smaller than a full-blown Training process. |
14 | 16 |
|
15 |
| -Given that, Inference lends itself nicely to the usage of Run:ai Fractions. You can, for example, run 4 instances of an Inference server on a single GPU, each employing a fourth of the memory. |
| 17 | +The inference process is a subset of the original training algorithm on a single datum (for example, one sentence or one image), or a small batch. As such, GPU memory requirements are typically smaller than a full-blown Training process. |
| 18 | + |
| 19 | +Given that, Inference lends itself nicely to the usage of Run:ai Fractions. You can, for example, run 4 instances of an Inference server on a single GPU, each employing a fourth of the memory. |
16 | 20 |
|
17 | 21 | ## Inference @Run:ai
|
18 | 22 |
|
19 |
| -Run:ai provides Inference services as an equal part together with the other two Workload types: _Train_ and _Build_. |
| 23 | +Run:ai provides Inference services as an equal part together with any other Workload type that is available. |
20 | 24 |
|
21 |
| -* Inference is considered a high-priority workload as it is customer-facing. Running an Inference workload (within the Project's quota) will preempt any Run:ai Workload marked as _Training_. |
| 25 | +* Inference is considered a high-priority workload as it is customer-facing. Running an Inference workload (within the Project's quota) will preempt any Run:ai Workload marked as *Training*. |
22 | 26 |
|
23 |
| -* Inference workloads will receive priority over _Train_ and _Build_ workloads during scheduling. |
| 27 | +* Inference workloads will receive priority over *Train* and *Build* workloads during scheduling. |
24 | 28 |
|
25 |
| -* Inference is implemented as a Kubernetes _Deployment_ object with a defined number of replicas. The replicas are load-balanced by Kubernetes so adding more replicas will improve the overall throughput of the system. |
| 29 | +* Inference is implemented as a Kubernetes *Deployment* object with a defined number of replicas. The replicas are load-balanced by Kubernetes so adding more replicas will improve the overall throughput of the system. |
26 | 30 |
|
27 | 31 | * Multiple replicas will appear in Run:ai as a single Inference workload. The workload will appear in all Run:ai dashboards and views as well as the Command-line interface.
|
28 | 32 |
|
29 |
| -* Inference workloads can be submitted via Run:ai [user interface](../admin-ui-setup/deployments.md) as well as [Run:ai API](../../developer/cluster-api/workload-overview-dev.md). Internally, spawning an Inference workload also creates a Kubernetes _Service_. The service is an end-point to which clients can connect. |
| 33 | +* Inference workloads can be submitted via Run:ai [user interface](../admin-ui-setup/deployments.md) as well as [Run:ai API](../../developer/cluster-api/workload-overview-dev.md). Internally, spawning an Inference workload also creates a Kubernetes *Service*. The service is an end-point to which clients can connect. |
30 | 34 |
|
31 |
| -## Auto Scaling |
| 35 | +## Autoscaling |
32 | 36 |
|
33 |
| -To withstand SLA, Inference workloads are typically set with _auto scaling_. Auto-scaling is the ability to add more computing power (Kubernetes pods) when the load increases and shrink allocated resources when the system is idle. |
| 37 | +To withstand SLA, Inference workloads are typically set with *autoscaling*. Autoscaling is the ability to add more computing power (Kubernetes pods) when the load increases and shrink allocated resources when the system is idle. |
34 | 38 |
|
35 |
| -There are a number of ways to trigger auto-scaling. Run:ai supports the following: |
| 39 | +There are a number of ways to trigger autoscaling. Run:ai supports the following: |
36 | 40 |
|
37 | 41 | | Metric | Units | Run:ai name |
|
38 | 42 | |-----------------|--------------|-----------------|
|
39 |
| -| GPU Utilization | % | gpu-utilization | |
40 |
| -| CPU Utilization | % | cpu-utilization | |
41 |
| -| Latency | milliseconds | latency | |
42 | 43 | | Throughput | requests/second | throughput |
|
43 |
| -| Concurrency | | concurrency | |
44 |
| -| Custom metric | | custom | |
| 44 | +| Concurrency | | concurrency | |
45 | 45 |
|
46 | 46 | The Minimum and Maximum number of replicas can be configured as part of the autoscaling configuration.
|
47 | 47 |
|
48 |
| -Auto Scaling also supports a scale to zero policy with _Throughput_ and _Concurrency_ metrics, meaning that given enough time under the target threshold, the number of replicas will be scaled down to 0. |
49 |
| -This has the benefit of conserving resources at the risk of a delay from "cold starting" the model when traffic resumes. |
50 |
| - |
| 48 | +Autoscaling also supports a scale to zero policy with *Throughput* and *Concurrency* metrics, meaning that given enough time under the target threshold, the number of replicas will be scaled down to 0. |
| 49 | +This has the benefit of conserving resources at the risk of a delay from "cold starting" the model when traffic resumes. |
51 | 50 |
|
52 | 51 | ## See Also
|
53 | 52 |
|
|
0 commit comments