Skip to content

Commit 6525dec

Browse files
Merge branch 'v2.17' into RUN-17809-Developer-pages-corrections
2 parents add8eca + 19c3997 commit 6525dec

14 files changed

+313
-59
lines changed

docs/Researcher/scheduling/dynamic-gpu-fractions.md

Lines changed: 23 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -7,54 +7,52 @@ date: 2023-10-31
77
---
88
## Introduction
99

10-
Many AI workloads as researchers' notebooks are using GPU resources intermittently. This means that these resources are not used all the time, but only when needed for running AI applications, or debugging a model in development. Other workloads such as Inference, might be using GPU resources at lower utilization rate than requested, and may suddenly ask for higher guaranteed resources at peak utilization times.
10+
Many AI workloads are using GPU resources intermittently and sometimes these resources are not used at all. These AI workloads need these resources when they are running AI applications, or debugging a model in development. Other workloads such as Inference, might be using GPU resources at a lower utilization rate than requested, and may suddenly ask for higher guaranteed resources at peak utilization times.
1111

1212
This pattern of resource request vs. actual resource utilization causes lower utilization of GPUs. This mainly happens if there are many workloads requesting resources to match their peak demand, even though the majority of the time they operate far below that peak.
1313

14-
Run:ai has introduced *Dynamic GPU fractions* in v2.15 in order to cope with resource request vs. actual resource utilization to enable users to optimize GPU resources.
14+
Run:ai has introduced *Dynamic GPU fractions* in v2.15 to cope with resource request vs. actual resource utilization which enables users to optimize GPU resource usage.
1515

1616
*Dynamic GPU fractions* is part of Run:ai's core capabilities to enable workloads to optimize the use of GPU resources. This works by providing the ability to specify and consume GPU memory and compute resources dynamically by leveraging Kubernetes *Request and Limit notations*.
1717

18-
*Dynamic GPU fractions* allows a workload to request a guaranteed fraction of GPU memory or GPU compute resource (similar to a Kubernetes request), and at the same time also request the ability to grow beyond that guaranteed request up to a specific limit (similar to a Kubernetes limit), if the resources are available.
18+
*Dynamic GPU fractions* allow a workload to request a guaranteed fraction of GPU memory or GPU compute resource (similar to a Kubernetes request), and at the same time also request the ability to grow beyond that guaranteed request up to a specific limit (similar to a Kubernetes limit), if the resources are available.
1919

20-
For example, with *Dynamic GPU Fractions*, a user can specify a workload with a GPU fraction Request of 0.25 GPU, and add the parameter `gpu-fraction-limit` of up to 0.80 GPU. The cluster/node-pool scheduler schedules the workload to a node that can provide the GPU fraction request (0.25), and then assigns the workload to a GPU. The GPU scheduler monitors the workload and allows it to occupy memory between 0 to 0.80 of the GPU memory (based on the parameter `gpu-fraction-limit`), where only 0.25 of the GPU memory is actually guaranteed to that workload. The rest of the memory (from 0.25 to 0.8) is “loaned” to the workload, as long as it is not needed by other workloads.
20+
For example, with *Dynamic GPU Fractions*, a user can specify a workload with a GPU fraction Request of 0.25 GPU, and add the parameter `gpu-fraction-limit` of up to 0.80 GPU. The cluster/node-pool scheduler schedules the workload to a node that can provide the GPU fraction request (0.25), and then assigns the workload to a GPU. The GPU scheduler monitors the workload and allows it to occupy memory between 0 to 0.80 of the GPU memory (based on the parameter `gpu-fraction-limit`), where only 0.25 of the GPU memory is guaranteed to that workload. The rest of the memory (from 0.25 to 0.8) is “loaned” to the workload, as long as it is not needed by other workloads.
2121

2222
Run:ai automatically manages the state changes between `request` and `Limit` as well as the reverse (when the balance need to be "returned"), updating the metrics and workloads' states and graphs.
2323

24-
## Setting Fractional GPU Memory Limits
24+
## Setting Fractional GPU Memory Limit
2525

2626
With the fractional GPU memory limit, users can submit workloads using GPU fraction `Request` and `Limit`.
2727

2828
You can either:
2929

3030
1. Use a GPU Fraction parameter (use the `gpu-fraction` annotation)
3131

32-
or
32+
or
3333

3434
2. Use an absolute GPU Memory parameter (`gpu-memory` annotation)
3535

36-
When set a GPU memory limit either as GPU fraction, or GPU memory size, the `Limit` must be equal or greater than the GPU fraction memory request.
36+
When setting a GPU memory limit either as GPU fraction, or GPU memory size, the `Limit` must be equal or greater than the GPU fraction memory request.
3737

38-
Both GPU fraction, and GPU memory are translated into the actual requested memory size of the Request (guaranteed) and the Limit (burstable).
38+
Both GPU fraction and GPU memory are translated into the actual requested memory size of the Request (guaranteed resources) and the Limit (burstable resources).
3939

4040
To guarantee fair quality of service between different workloads using the same GPU, Run:ai developed an extendable GPU `OOMKiller` (Out Of Memory Killer) component that guarantees the quality of service using Kubernetes semantics for resources Request and Limit.
4141

4242
The `OOMKiller` capability requires adding `CAP_KILL` capabilities to the *Dynamic GPU fraction* and to the Run:ai core scheduling module (toolkit daemon). This capability is disabled by default.
4343

44-
To enable *Dynamic GPU Fraction* it, edit the `runaiconfig` file and set:
44+
To change the state of *Dynamic GPU Fraction* in the cluster, edit the `runaiconfig` file and set:
4545

4646
```YAML
4747
spec:
4848
global:
4949
core:
5050
dynamicFraction:
51-
enabled: true
51+
enabled: true # Boolean field default is true.
5252
```
5353
5454
To set the gpu memory limit per workload, add the `RUNAI_GPU_MEMORY_LIMIT` environment variable to the first container in the pod. This is the GPU consuming container.
5555

56-
<!-- TODO: add example yaml). -->
57-
5856
To use `RUNAI_GPU_MEMORY_LIMIT` environment variable:
5957

6058
1. Submit a workload yaml directly, and set the `RUNAI_GPU_MEMORY_LIMIT` environment variable.
@@ -69,3 +67,16 @@ The supported values depend on the label used. You can use them in either the UI
6967
| --- | --- |
7068
| `gpu-fraction` | A fraction value (for example: 0.25, 0.75). |
7169
| `gpu-memory` | Kubernetes resources quantity which **must** be larger than `gpu-memory`. For example, 500000000, 2500M, 4G. **NOTE**: The `gpu-memory` label values are always in MB, unlike the env variable. |
70+
71+
## Compute Resources UI with Dynamic Fractions support
72+
73+
To enable the UI elements for Dynamic Fractions, press *Settings*, *General*, then open the *Resources* pane and toggle *GPU Resource Optimization*. This enables all the UI features related to *GPU Resource Optimization* for the whole tenant. There are other per cluster or per node-pool configurations that should be configured in order to use the capabilities of ‘GPU Resource Optimization’ See the documentation for each of these features.
74+
Once the ‘GPU Resource Optimization’ feature is enabled, you will be able to create *Compute Resources* with the *GPU Portion (Fraction)* Limit and *GPU Memory Limit*. In addition, you will be able to view the workloads’ utilization vs. Request and Limit parameters in the [Metrics](../../admin/workloads/submitting-workloads.md#workloads-table) pane for each workload.
75+
76+
![GPU Limit](img/GPU-resource-limit-enabled.png)
77+
78+
!!! Note
79+
To use Dynamic Fractions, *GPU devices per pod* must be equal to 1. If more than 1 GPU device is used per pod, or if a MIG profile is selected, Dynamic Fractions cannot be used for that Compute Resource (and any related pods).
80+
81+
!!! Note
82+
When setting a workload with Dynamic Fractions, (for example, when using it with GPU Request or GPU memory Limits), you practically make the workload burstable. This means it can use memory that is not guaranteed for that workload and is susceptible to an ‘OOM Kill’ signal if the actual owner of that memory requires it back. This applies to non-preemptive workloads as well. For that reason, its recommended that you use Dynamic Fractions with Interactive workloads running Notebooks. Notebook pods are not evicted when their GPU process is OOM Kill’ed. This behavior is the same as standard Kubernetes burstable CPU workloads.
Loading
15.2 KB
Loading
14.3 KB
Loading
18.2 KB
Loading
17.1 KB
Loading
Lines changed: 61 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,61 @@
1+
---
2+
title: Optimize performance with Node Level Scheduler
3+
summary: This article describes the Node level scheduler which deals with how node scheduling is optimized.
4+
authors:
5+
- Hagay Sharon
6+
- Jason Novich
7+
date: 2024-Apr-4
8+
---
9+
10+
The Node Level Scheduler optimizes the performance of your pods and maximizes the utilization of GPUs by making optimal local decisions on GPU allocation to your pods. While the Cluster Scheduler chooses the specific node for a POD, but has no visibility to node’s GPUs internal state, the Node Level Scheduler is aware of the local GPUs states and makes optimal local decisions such that it can optimize both the GPU utilization and pods’ performance running on the node’s GPUs.
11+
12+
Node Level Scheduler applies to all workload types, but will best optimize the performance of burstable workloads, giving those more GPU memory than requested and up to the limit specified. Be aware, burstable workloads are always susceptible to an OOM Kill signal if the owner of the excess memory requires it back. This means that using the Node Level Scheduler with Inference or Training workloads may cause pod preemption. Interactive workloads that are using notebooks behave differently since the OOM Kill signal will cause the Notebooks' GPU process to exit but not the notebook itself. This keeps the Interactive pod running and retrying to attach a GPU again. This makes Interactive workloads with notebooks a great use case for burstable workloads and Node Level Scheduler.
13+
14+
## Interactive Notebooks Use Case
15+
16+
Consider the following example of a node with 2 GPUs and 2 interactive pods that are submitted and want GPU resources.
17+
18+
![Unallocated GPU nodes](img/gpu-node-1.png)
19+
20+
The Scheduler instructs the node to put the two pods on a single GPU, bin packing a single GPU and leaving the other free for a workload that might want a full GPU or more than half GPU. However that would mean GPU#2 is idle while the two notebooks can only use up to half a GPU, even if they temporarily need more.
21+
22+
![Single allocated GPU node](img/gpu-node-2.png)
23+
24+
However, with Node Level Scheduler enabled, the local decision will be to spread those two pods on two GPUs and allow them to maximize bot pods’ performance and GPUs’ utilization by bursting out up to the full GPU memory and GPU compute resources.
25+
26+
![Two allocated GPU nodes](img/gpu-node-3.png)
27+
28+
The Cluster Scheduler still sees a node with a full empty GPU.
29+
When a 3rd pod is scheduled, and it requires a full GPU (or more than 0.5 GPU), the scheduler will send it to that node, and Node Level Scheduler will move one of the Interactive workloads to run with the other pod in GPU#1, as was the Cluster Scheduler initial plan.
30+
31+
![Node Level Scheduler locally optimized GPU nodes](img/gpu-node-4.png)
32+
33+
This is an example of one scenario that shows how Node Level Scheduler locally optimizes and maximizes GPU utilization and pods’ performance.
34+
35+
## How to configure Node Level Scheduler
36+
37+
Node Level Scheduler can be enabled per Node-Pool, giving the Administrator the option to decide which Node-Pools will be used with this new feature.
38+
39+
To use Node Level Scheduler the Administrator should follow the steps:
40+
41+
1. Enable Node Level Scheduler at the cluster level (per cluster), edit the `runaiconfig` file and set:
42+
43+
```YAML
44+
spec:
45+
global:
46+
core:
47+
nodeScheduler:
48+
enabled: true
49+
```
50+
51+
The Administrator can also use this patch command to perform the change:
52+
53+
```bash
54+
kubectl patch -n runai runaiconfigs.run.ai/runai --type='merge' --patch '{"spec":{"global":{"core":{"nodeScheduler":{"enabled": true}}}}}'
55+
```
56+
57+
2. To enable ‘GPU resource optimization’ on your tenant’s, go to your tenant’s UI and press *Tools & Settings*, *General*, the open the *Resources* pane and toggle *Resource Optimization* to on.
58+
59+
3. To enable ‘Node Level Scheduler’ on any of the Node Pools you want to use this feature, go to the tenant’s UI ‘Node Pools’ tab (under ‘Nodes’), and either create a new Node-Pool or edit an existing Node-Pool. In the Node-Pool’s form, under the ‘Resource Utilization Optimization’ tab, change the ‘Number of workloads on each GPU’ to any value other than ‘Not Enforced’ (i.e. 2, 3, 4, 5).
60+
61+
The Node Level Scheduler is now ready to be used on that Node-Pool.

docs/Researcher/scheduling/the-runai-scheduler.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -59,7 +59,7 @@ Every new workload is associated with a Project. The Project contains a deserved
5959

6060
**Node pools are enabled**
6161

62-
Every new workload is associated with a Project. The Project contains a deserved GPU quota that is the sum off all node pools GPU quotas. During scheduling:
62+
Every new workload is associated with a Project. The Project contains a deserved GPU quota that is the sum of all node pools GPU quotas. During scheduling:
6363

6464
* If the newly required resources, together with currently used resources, end up within the overall Project's quota and the requested node pool(s) quota, then the workload is ready to be scheduled as part of the guaranteed quota.
6565
* If the newly required resources together with currently used resources end up above the Project's quota or the requested node pool(s) quota, the workload will only be scheduled if there are 'spare' GPU resources within the same node pool but not part of this Project. There are nuances in this flow that are meant to ensure that a Project does not end up with an over-quota made entirely of interactive workloads. For additional details see below.
@@ -100,6 +100,7 @@ The Run:ai scheduler wakes up periodically to perform allocation tasks on pendin
100100
* The scheduler then recalculates the next 'deprived' Project and continues with the same flow until it finishes attempting to schedule all workloads
101101

102102
### Node Pools
103+
103104
A *Node Pool* is a set of nodes grouped by an Administrator into a distinct group of resources from which resources can be allocated to Projects and Departments.
104105
By default, any node pool created in the system is automatically associated with all Projects and Departments using zero quota resource (GPUs, CPUs, Memory) allocation. This allows any Project and Department to use any node pool with Over-Quota (for Preemptible workloads), thus maximizing the system resource utilization.
105106

0 commit comments

Comments
 (0)