Skip to content

Commit 95d96f6

Browse files
RUN-12616 add Hagay's changes
1 parent 262a30e commit 95d96f6

File tree

1 file changed

+17
-12
lines changed

1 file changed

+17
-12
lines changed

docs/Researcher/scheduling/gpu-memory-swap.md

Lines changed: 17 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -13,37 +13,37 @@ To ensure efficient and effective usage of an organization’s resources, Run:ai
1313

1414
Run:ai’s *GPU memory swap* feature helps administrators and AI practitioners to further increase the utilization of existing GPU hardware by improving GPU sharing between AI initiatives and stakeholders. This is done by expending the GPU physical memory to the CPU memory which is typically an order of magnitude larger than that of the GPU.
1515

16-
Expending the GPU physical memory, helps the Run:ai system to put more workloads on the same GPU physical hardware, and to provide a smooth workload context switching between GPU memory and CPU memory, eliminating the need to kill workloads when the memory requirement is larger than what the GPU physical memory can provide, as long as each single workload requires no more than the size of the GPU physical memory.
16+
Expending the GPU physical memory, helps the Run:ai system to put more workloads on the same GPU physical hardware, and to provide a smooth workload context switching between GPU memory and CPU memory, eliminating the need to kill workloads when the memory requirement is larger than what the GPU physical memory can provide.
1717

1818
## Benefits of GPU memory swap
1919

2020
There are several use cases where GPU memory swap can benefit and improve the user experience and the system's overall utilization:
2121

2222
### Sharing a GPU between multiple interactive workloads (notebooks)
2323

24-
AI practitioners use notebooks to develop and test new AI models and to improve existing AI models. While developing or testing an AI model, notebooks use GPU resources intermittently, but the required resources of the GPU’s are pre-allocated by the notebook and cannot be used by other workloads after one notebook has already reserved them. To overcome this inefficiency, Run:ai introduced *Dynamic Fractions* and *Node Level Scheduler*.
24+
AI practitioners use notebooks to develop and test new AI models and to improve existing AI models. While developing or testing an AI model, notebooks use GPU resources intermittently, yet, required resources of the GPU’s are pre-allocated by the notebook and cannot be used by other workloads after one notebook has already reserved them. To overcome this inefficiency, Run:ai introduced *Dynamic Fractions* and *Node Level Scheduler*.
2525

2626
When one or more workloads require more than their requested GPU resources, there’s a high probability not all workloads can run on a single GPU because the total memory required is larger than the physical size of the GPU memory.
2727

2828
With *GPU memory swap*, several workloads can run on the same GPU, even if the sum of their used memory is larger than the size of the physical GPU memory. *GPU memory swap* can swap in and out workloads interchangeably, allowing multiple workloads to each use the full amount of GPU memory. The most common scenario is for one workload to run on the GPU (for example, an interactive notebook),while other notebooks are either idle or using the CPU to develop new code (while not using the GPU). From a user experience point of view, the swap in and out is a smooth process since the notebooks do not notice that they are being swapped in and out of the GPU memory. On rare occasions, when multiple notebooks need to access the GPU simultaneously, slower workload execution may be experienced.
2929

30-
The assumption is that notebooks only use the GPU intermittently, therefore with high probability, only one workload (for example, an interactive notebook), will use the GPU at a time. The more notebooks the system puts on a single GPU, the higher the chances are that there will be more than one notebook requiring the GPU resources at the same time. Admins have a significant role here in fine tuning the amount of notebooks running on the same GPU, based on specific use patterns and required SLAs.
30+
Customers’ experience indicates that notebooks only use the GPU intermittently, therefore with high probability, only one workload (for example, an interactive notebook), will use the GPU at a time. The more notebooks the system puts on a single GPU, the higher the chances are that there will be more than one notebook requiring the GPU resources at the same time. Admins have a significant role here in fine tuning the amount of notebooks running on the same GPU, based on specific use patterns and required SLAs. Using ‘Node Level Scheduler’ reduces GPU access contention between different interactive notebooks running on the same node.
3131

32-
### Sharing a GPU between "frontend" interactive workloads and "background" training workloads
32+
### Sharing a GPU between inference/interactive workloads and training workloads
3333

34-
A single GPU can be shared between an interactive frontend workload (for example, a Jupyter notebook, image recognition services or an LLM service), and a backend training process that is not time sensitive or delay sensitive. At times when the inference/interactive workload uses the GPU, both training and inference/interactive workloads share the GPU resources, each running part of the time swapped-in to the GPU memory, and swapped-out into the CPU memory the rest of the time.
34+
A single GPU can be shared between an interactive or inference workload (for example, a Jupyter notebook, image recognition services, or an LLM service), and a training workload that is not time sensitive or delay sensitive. At times when the inference/interactive workload uses the GPU, both training and inference/interactive workloads share the GPU resources, each running part of the time swapped-in to the GPU memory, and swapped-out into the CPU memory the rest of the time.
3535

3636
Whenever the inference/interactive workload stops using the GPU, the swap mechanism swaps out the inference/interactive workload GPU data to the CPU memory. In terms of Kubernetes, the POD is still alive and running using the CPU. This allows the training workload to run faster when the inference/interactive workload is not using the GPU, and slower when it does, thus sharing the same resource between multiple workloads, fully utilizing the GPU at all times, and maintaining uninterrupted service for both workloads.
3737

3838
### Serving inference warm models with GPU memory swap
3939

40-
Running multiple inference models is a demanding task and need you will need to ensure that your SLA is met. You need to provide high performance and low latency, while maximizing GPU utilization. This becomes even more challenging when the exact model usage patterns are unpredictable. You must plan for the agility of inference services and strive to keep models on standby in a ready state rather than a idle state.
40+
Running multiple inference models is a demanding task and you will need to ensure that your SLA is met. You need to provide high performance and low latency, while maximizing GPU utilization. This becomes even more challenging when the exact model usage patterns are unpredictable. You must plan for the agility of inference services and strive to keep models on standby in a ready state rather than an idle state.
4141

42-
Run:ai’s *GPU memory swap* feature enables you to load multiple models to a single GPU, where each can use up to the full amount GPU memory. Using a load balancer, the administrator can control to which server each inference request is sent. Then the GPU can be loaded with multiple models, where the model in use is loaded into the GPU memory and the rest of the models are swapped-out to the CPU memory. The swapped models are stored as ready models to be loaded when required. *GPU memory swap* always maintains the context of the workload on the GPU so it can easily and quickly switch between models, unlike idle models that must be loaded completely from scratch.
42+
Run:ai’s *GPU memory swap* feature enables you to load multiple models to a single GPU, where each can use up to the full amount GPU memory. Using an application load balancer, the administrator can control to which server each inference request is sent. Then the GPU can be loaded with multiple models, where the model in use is loaded into the GPU memory and the rest of the models are swapped-out to the CPU memory. The swapped models are stored as ready models to be loaded when required. *GPU memory swap* always maintains the context of the workload (model) on the GPU so it can easily and quickly switch between models, unlike industry-standard model servers that load models completelyfrom scratch into the GPU when required.
4343

4444
## Configuring memory swap
4545

46-
**Perquisites**—before configuring the *GPU Memory Swap* the admin must configure the *Dynamic Fractions* feature, and optionally configure the *Node Level Scheduler* feature. Both these configurations are designed to maximize performance within a single node.
46+
**Perquisites**—before configuring the *GPU Memory Swap* the administrator must configure the *Dynamic Fractions* feature, and optionally configure the *Node Level Scheduler* feature. The first enables you to make your workloads burstable, and both features will maximize your workloads’ performance and GPU utilization within a single node.
4747

4848
To enable *GPU memory swap* in a Run:aAi cluster, the administrator must update the `runaiconfig` file with the following parameters:
4949

@@ -67,9 +67,9 @@ kubectl patch -n runai runaiconfigs.run.ai/runai --type='merge' --patch '{"spec"
6767

6868
To make a workload swappable, a number of conditions must be met:
6969

70-
1. The workload MUST use Dynamic Fractions. This means the workload’s memory request is less than a full GPU, but it may add a GPU memory limit to allow the workload to effectively use the full GPU memory. If regular fractions are used instead of Dynamic Fractions is NOT used but regular fraction is (for that workload), the swap logic assumes this workload prefers NOT to be swapped-out and therefore, all other workloads on the same GPU are NOT swapped either.
70+
1. The workload MUST use Dynamic Fractions. This means the workload’s memory request is less than a full GPU, but it may add a GPU memory limit to allow the workload to effectively use the full GPU memory.
7171

72-
2. The administrator must label each node that they want to provide GPU memory swap with a `run.ai/swap-enabled=true` this enables the feature on that node. Enabling the feature creates a local swap file in the CPU to serve the swapped memory from all GPUs on that node. The administrator sets the size of the CPU swap file as a value in the `runaiconfig` file.
72+
2. The administrator must label each node that they want to provide GPU memory swap with a `run.ai/swap-enabled=true` this enables the feature on that node. Enabling the feature reserves CPU memory to serve the swapped GPU memory from all GPUs on that node. The administrator sets the size of the CPU reserved RAM memory using the runaiconfigs file.
7373

7474
3. Optionally configure *Node Level Scheduler*. Using node level scheduler can help in the following ways:
7575

@@ -78,7 +78,7 @@ To make a workload swappable, a number of conditions must be met:
7878

7979
### Configure `system reserved` GPU Resources
8080

81-
Swappable workloads require reserving a small part of the GPU for non-swappable allocations like binaries and GPU context. To avoid getting out-of-memory (OOM) errors due to non-swappable memory regions, the system reserves a 2GiB of GPU RAM memory by default, effectively truncating the total size of the GPU. For example, a 16GiB T4 will appear as 14GiB on a swap-enabled node.
81+
Swappable workloads require reserving a small part of the GPU for non-swappable allocations like binaries and GPU context. To avoid getting out-of-memory (OOM) errors due to non-swappable memory regions, the system reserves a 2GiB of GPU RAM memory by default, effectively truncating the total size of the GPU memory. For example, a 16GiB T4 will appear as 14GiB on a swap-enabled node.
8282
The exact reserved size is application-dependent, and 2GiB is a safe assumption for 2-3 applications sharing and swapping on a GPU.
8383
This value can be changed by editing the `runaiconfig` specification as follows:
8484

@@ -101,7 +101,12 @@ This configuration is in addition to the *Dynamic Fractions* configuration, and
101101

102102
## Preventing your workloads from getting swapped
103103

104-
If you prefer your workloads not to be swapped into CPU memory, you can specify an anti-affinity to `run.ai/swap-enabled=true` node label when submitting your workloads and the Scheduler will ensure not to use swap-enabled nodes.
104+
If you prefer your workloads not to be swapped into CPU memory, you can specify on the pod an anti-affinity to `run.ai/swap-enabled=true` node label when submitting your workloads and the Scheduler will ensure not to use swap-enabled nodes.
105+
106+
## Known Limitations
107+
108+
* GPU memory swap cannot be enabled if fairshare time-slicing or strict time-slicing is used, GPU memory swap can only be used with the default time-slicing mechanism.
109+
* CPU RAM size cannot be decreased once GPU memory swap is enabled.
105110

106111
## What happens when CPU SWAP file is exhausted?
107112

0 commit comments

Comments
 (0)