run-ai
diff --git a/‎docs/Researcher/scheduling/dynamic-gpu-fractions.md
Lines changed: 23 additions & 12 deletions b/‎docs/Researcher/scheduling/dynamic-gpu-fractions.md
Lines changed: 23 additions & 12 deletions
diff --git a/‎docs/Researcher/scheduling/img/GPU-resource-limit-enabled.png
51.1 KB b/‎docs/Researcher/scheduling/img/GPU-resource-limit-enabled.png
51.1 KB
diff --git a/‎docs/Researcher/scheduling/img/gpu-node-1.png
15.2 KB b/‎docs/Researcher/scheduling/img/gpu-node-1.png
15.2 KB
diff --git a/‎docs/Researcher/scheduling/img/gpu-node-2.png
14.3 KB b/‎docs/Researcher/scheduling/img/gpu-node-2.png
14.3 KB
diff --git a/‎docs/Researcher/scheduling/img/gpu-node-3.png
18.2 KB b/‎docs/Researcher/scheduling/img/gpu-node-3.png
18.2 KB
diff --git a/‎docs/Researcher/scheduling/img/gpu-node-4.png
17.1 KB b/‎docs/Researcher/scheduling/img/gpu-node-4.png
17.1 KB
diff --git a/‎docs/Researcher/scheduling/node-level-scheduler.md
Lines changed: 61 additions & 0 deletions b/‎docs/Researcher/scheduling/node-level-scheduler.md
Lines changed: 61 additions & 0 deletions
diff --git a/‎docs/Researcher/scheduling/the-runai-scheduler.md
Lines changed: 2 additions & 1 deletion b/‎docs/Researcher/scheduling/the-runai-scheduler.md
Lines changed: 2 additions & 1 deletion
@@ -7,54 +7,52 @@ date: 2023-10-31
 ---
 ## Introduction
 
-Many AI workloads as researchers' notebooks are using GPU resources intermittently. This means that these resources are not used all the time, but only when needed for running AI applications, or debugging a model in development. Other workloads such as Inference, might be using GPU resources at lower utilization rate than requested, and may suddenly ask for higher guaranteed resources at peak utilization times.
+Many AI workloads are using GPU resources intermittently and sometimes these resources are not used at all. These AI workloads need these resources when they are running AI applications, or debugging a model in development. Other workloads such as Inference, might be using GPU resources at a lower utilization rate than requested, and may suddenly ask for higher guaranteed resources at peak utilization times.
 
 This pattern of resource request vs. actual resource utilization causes lower utilization of GPUs. This mainly happens if there are many workloads requesting resources to match their peak demand, even though the majority of the time they operate far below that peak.
 
-Run:ai has introduced *Dynamic GPU fractions* in v2.15 in order to cope with resource request vs. actual resource utilization to enable users to optimize GPU resources.
+Run:ai has introduced *Dynamic GPU fractions* in v2.15 to cope with resource request vs. actual resource utilization which enables users to optimize GPU resource usage.
 
 *Dynamic GPU fractions* is part of Run:ai's core capabilities to enable workloads to optimize the use of GPU resources. This works by providing the ability to specify and consume GPU memory and compute resources dynamically by leveraging Kubernetes *Request and Limit notations*.
 
-*Dynamic GPU fractions* allows a workload to request a guaranteed fraction of GPU memory or GPU compute resource (similar to a Kubernetes request), and at the same time also request the ability to grow beyond that guaranteed request up to a specific limit (similar to a Kubernetes limit), if the resources are available.
+*Dynamic GPU fractions* allow a workload to request a guaranteed fraction of GPU memory or GPU compute resource (similar to a Kubernetes request), and at the same time also request the ability to grow beyond that guaranteed request up to a specific limit (similar to a Kubernetes limit), if the resources are available.
 
-For example, with *Dynamic GPU Fractions*, a user can specify a workload with a GPU fraction Request of 0.25 GPU, and add the parameter `gpu-fraction-limit` of up to 0.80 GPU. The cluster/node-pool scheduler schedules the workload to a node that can provide the GPU fraction request (0.25), and then assigns the workload to a GPU. The GPU scheduler monitors the workload and allows it to occupy memory between 0 to 0.80 of the GPU memory (based on the parameter `gpu-fraction-limit`), where only 0.25 of the GPU memory is actually guaranteed to that workload. The rest of the memory (from 0.25 to 0.8) is “loaned” to the workload, as long as it is not needed by other workloads.
+For example, with *Dynamic GPU Fractions*, a user can specify a workload with a GPU fraction Request of 0.25 GPU, and add the parameter `gpu-fraction-limit` of up to 0.80 GPU. The cluster/node-pool scheduler schedules the workload to a node that can provide the GPU fraction request (0.25), and then assigns the workload to a GPU. The GPU scheduler monitors the workload and allows it to occupy memory between 0 to 0.80 of the GPU memory (based on the parameter `gpu-fraction-limit`), where only 0.25 of the GPU memory is guaranteed to that workload. The rest of the memory (from 0.25 to 0.8) is “loaned” to the workload, as long as it is not needed by other workloads.
 
 Run:ai automatically manages the state changes between `request` and `Limit` as well as the reverse (when the balance need to be "returned"), updating the metrics and workloads' states and graphs.
 
-## Setting Fractional GPU Memory Limits
+## Setting Fractional GPU Memory Limit
 
 With the fractional GPU memory limit, users can submit workloads using GPU fraction `Request` and `Limit`.
 
 You can either:
 
 1. Use a GPU Fraction parameter (use the `gpu-fraction` annotation)
 
-or
+    or
 
 2. Use an absolute GPU Memory parameter (`gpu-memory` annotation)
 
-When set a GPU memory limit either as GPU fraction, or GPU memory size, the `Limit` must be equal or greater than the GPU fraction memory request.
+When setting a GPU memory limit either as GPU fraction, or GPU memory size, the `Limit` must be equal or greater than the GPU fraction memory request.
 
-Both GPU fraction, and GPU memory are translated into the actual requested memory size of the Request (guaranteed) and the Limit (burstable).
+Both GPU fraction and GPU memory are translated into the actual requested memory size of the Request (guaranteed resources) and the Limit (burstable resources).
 
 To guarantee fair quality of service between different workloads using the same GPU, Run:ai developed an extendable GPU `OOMKiller` (Out Of Memory Killer) component that guarantees the quality of service using Kubernetes semantics for resources Request and Limit.
 
 The `OOMKiller` capability requires adding `CAP_KILL` capabilities to the *Dynamic GPU fraction* and to the Run:ai core scheduling module (toolkit daemon). This capability is disabled by default.
 
-To enable *Dynamic GPU Fraction* it, edit the `runaiconfig` file and set:
+To change the state of *Dynamic GPU Fraction* in the cluster, edit the `runaiconfig` file and set:
 
 ```YAML
 spec: 
   global: 
     core: 
       dynamicFraction: 
-        enabled: true
+        enabled: true # Boolean field default is true.
 ```
 
 To set the gpu memory limit per workload, add the `RUNAI_GPU_MEMORY_LIMIT` environment variable to the first container in the pod. This is the GPU consuming container.
 
-<!-- TODO: add example yaml). -->
-
 To use `RUNAI_GPU_MEMORY_LIMIT` environment variable:
 
 1. Submit a workload yaml directly, and set the `RUNAI_GPU_MEMORY_LIMIT` environment variable.
@@ -69,3 +67,16 @@ The supported values depend on the label used. You can use them in either the UI
 | --- |  --- |
 | `gpu-fraction`  | A fraction value (for example: 0.25, 0.75). |
 | `gpu-memory`  | Kubernetes resources quantity which **must** be larger than `gpu-memory`. For example, 500000000, 2500M, 4G. **NOTE**: The `gpu-memory` label values are always in MB, unlike the env variable. |
+
+## Compute Resources UI with Dynamic Fractions support
+
+To enable the UI elements for Dynamic Fractions, press *Settings*, *General*, then open the *Resources* pane and toggle *GPU Resource Optimization*. This enables all the UI features related to *GPU Resource Optimization* for the whole tenant. There are other per cluster or per node-pool configurations that should be configured in order to use the capabilities of ‘GPU Resource Optimization’ See the documentation for each of these features.
+Once the ‘GPU Resource Optimization’ feature is enabled, you will be able to create *Compute Resources* with the *GPU Portion (Fraction)* Limit and *GPU Memory Limit*. In addition, you will be able to view the workloads’ utilization vs. Request and Limit parameters in the [Metrics](../../admin/workloads/submitting-workloads.md#workloads-table) pane for each workload.
+
+![GPU Limit](img/GPU-resource-limit-enabled.png)
+
+!!! Note
+    To use Dynamic Fractions, *GPU devices per pod* must be equal to 1. If more than 1 GPU device is used per pod, or if a MIG profile is selected, Dynamic Fractions cannot be used for that Compute Resource (and any related pods).
+
+!!! Note
+    When setting a workload with Dynamic Fractions, (for example, when using it with GPU Request or GPU memory Limits), you practically make the workload burstable. This means it can use memory that is not guaranteed for that workload and is susceptible to an ‘OOM Kill’ signal if the actual owner of that memory requires it back. This applies to non-preemptive workloads as well. For that reason, its recommended that you use Dynamic Fractions with Interactive workloads running Notebooks. Notebook pods are not evicted when their GPU process is OOM Kill’ed. This behavior is the same as standard Kubernetes burstable CPU workloads.
@@ -0,0 +1,61 @@
+---
+title: Optimize performance with Node Level Scheduler
+summary: This article describes the Node level scheduler which deals with how node scheduling is optimized.
+authors:
+    - Hagay Sharon
+    - Jason Novich    
+date: 2024-Apr-4
+---
+
+The Node Level Scheduler optimizes the performance of your pods and maximizes the utilization of GPUs by making optimal local decisions on GPU allocation to your pods. While the Cluster Scheduler chooses the specific node for a POD, but has no visibility to node’s GPUs internal state, the Node Level Scheduler is aware of the local GPUs states and makes optimal local decisions such that it can optimize both the GPU utilization and pods’ performance running on the node’s GPUs.
+
+Node Level Scheduler applies to all workload types, but will best optimize the performance of burstable workloads, giving those more GPU memory than requested and up to the limit specified. Be aware, burstable workloads are always susceptible to an OOM Kill signal if the owner of the excess memory requires it back. This means that using the Node Level Scheduler with Inference or Training workloads may cause pod preemption. Interactive workloads that are using notebooks behave differently since the OOM Kill signal will cause the Notebooks' GPU process to exit but not the notebook itself. This keeps the Interactive pod running and retrying to attach a GPU again. This makes Interactive workloads with notebooks a great use case for burstable workloads and Node Level Scheduler.
+
+## Interactive Notebooks Use Case
+
+Consider the following example of a node with 2 GPUs and 2 interactive pods that are submitted and want GPU resources.
+
+![Unallocated GPU nodes](img/gpu-node-1.png)
+
+The Scheduler instructs the node to put the two pods on a single GPU, bin packing a single GPU and leaving the other free for a workload that might want a full GPU or more than half GPU. However that would mean GPU#2 is idle while the two notebooks can only use up to half a GPU, even if they temporarily need more.
+
+![Single allocated GPU node](img/gpu-node-2.png)
+
+However, with Node Level Scheduler enabled, the local decision will be to spread those two pods on two GPUs and allow them to maximize bot pods’ performance and GPUs’ utilization by bursting out up to the full GPU memory and GPU compute resources.
+
+![Two allocated GPU nodes](img/gpu-node-3.png)
+
+The Cluster Scheduler still sees a node with a full empty GPU.
+When a 3rd pod is scheduled, and it requires a full GPU (or more than 0.5 GPU), the scheduler will send it to that node, and Node Level Scheduler will move one of the Interactive workloads to run with the other pod in GPU#1, as was the Cluster Scheduler initial plan.
+
+![Node Level Scheduler locally optimized GPU nodes](img/gpu-node-4.png)
+
+This is an example of one scenario that shows how Node Level Scheduler locally optimizes and maximizes GPU utilization and pods’ performance.
+
+## How to configure Node Level Scheduler
+
+Node Level Scheduler can be enabled per Node-Pool, giving the Administrator the option to decide which Node-Pools will be used with this new feature.
+
+To use Node Level Scheduler the Administrator should follow the steps:
+
+1. Enable Node Level Scheduler at the cluster level (per cluster), edit the `runaiconfig` file and set:
+
+    ```YAML
+    spec: 
+      global: 
+          core: 
+            nodeScheduler:
+            enabled: true
+    ```
+
+    The Administrator can also use this patch command to perform the change:
+
+    ```bash
+    kubectl patch -n runai runaiconfigs.run.ai/runai --type='merge' --patch '{"spec":{"global":{"core":{"nodeScheduler":{"enabled": true}}}}}'
+    ```
+
+2. To enable ‘GPU resource optimization’ on your tenant’s, go to your tenant’s UI and press *Tools & Settings*, *General*, the open the *Resources* pane and toggle *Resource Optimization* to on.
+
+3. To enable ‘Node Level Scheduler’ on any of the Node Pools you want to use this feature, go to the tenant’s UI ‘Node Pools’ tab (under ‘Nodes’), and either create a new Node-Pool or edit an existing Node-Pool. In the Node-Pool’s form, under the ‘Resource Utilization Optimization’ tab, change the ‘Number of workloads on each GPU’ to any value other than ‘Not Enforced’ (i.e. 2, 3, 4, 5).
+
+The Node Level Scheduler is now ready to be used on that Node-Pool.
@@ -59,7 +59,7 @@ Every new workload is associated with a Project. The Project contains a deserved
 
 **Node pools are enabled**
 
-Every new workload is associated with a Project. The Project contains a deserved GPU quota that is the sum off all node pools GPU quotas. During scheduling:
+Every new workload is associated with a Project. The Project contains a deserved GPU quota that is the sum of all node pools GPU quotas. During scheduling:
 
 * If the newly required resources, together with currently used resources, end up within the overall Project's quota and the requested node pool(s) quota, then the workload is ready to be scheduled as part of the guaranteed quota.
 * If the newly required resources together with currently used resources end up above the Project's quota or the requested node pool(s) quota, the workload will only be scheduled if there are 'spare' GPU resources within the same node pool but not part of this Project. There are nuances in this flow that are meant to ensure that a Project does not end up with an over-quota made entirely of interactive workloads. For additional details see below.
@@ -100,6 +100,7 @@ The Run:ai scheduler wakes up periodically to perform allocation tasks on pendin
 * The scheduler then recalculates the next 'deprived' Project and continues with the same flow until it finishes attempting to schedule all workloads
 
 ### Node Pools
+
 A *Node Pool* is a set of nodes grouped by an Administrator into a distinct group of resources from which resources can be allocated to Projects and Departments.
 By default, any node pool created in the system is automatically associated with all Projects and Departments using zero quota resource (GPUs, CPUs, Memory) allocation. This allows any Project and Department to use any node pool with Over-Quota (for Preemptible workloads), thus maximizing the system resource utilization.