diff --git a/docs/Researcher/scheduling/GPU-time-slicing-scheduler.md b/docs/Researcher/scheduling/GPU-time-slicing-scheduler.md index 371a938646..d12a3fe72b 100644 --- a/docs/Researcher/scheduling/GPU-time-slicing-scheduler.md +++ b/docs/Researcher/scheduling/GPU-time-slicing-scheduler.md @@ -11,7 +11,7 @@ Run:ai supports simultaneous submission of multiple workloads to a single GPU wh ## New Time-slicing scheduler by Run:ai -To provide customers with predictable and accurate GPU compute resources scheduling, Run:ai is introducing a new feature called Time-slicing GPU scheduler which adds **fractional compute** capabilities on top of other existing Run:ai **memory fractions** capabilities. Unlike the default NVIDIA GPU orchestrator which doesn’t provide the ability to split or limit the runtime of each workload, Run:ai created a new mechanism that gives each workload **exclusive** access to the full GPU for a **limited** amount of time ([lease time](#time-slicing-plan-and-lease-times)) in each scheduling cycle ([plan time](#timeslicing-plan-and-lease-times)). This cycle repeats itself for the lifetime of the workload. +To provide customers with predictable and accurate GPU compute resources scheduling, Run:ai is introducing a new feature called Time-slicing GPU scheduler which adds **fractional compute** capabilities on top of other existing Run:ai **memory fractions** capabilities. Unlike the default NVIDIA GPU orchestrator which doesn’t provide the ability to split or limit the runtime of each workload, Run:ai created a new mechanism that gives each workload **exclusive** access to the full GPU for a **limited** amount of time ([lease time](#time-slicing-plan-and-lease-times)) in each scheduling cycle ([plan time](#time-slicing-plan-and-lease-times)). This cycle repeats itself for the lifetime of the workload. Using the GPU runtime this way guarantees a workload is granted its requested GPU compute resources proportionally to its requested GPU fraction. diff --git a/docs/admin/runai-setup/cluster-setup/cluster-prerequisites.md b/docs/admin/runai-setup/cluster-setup/cluster-prerequisites.md index 58b5de1852..c4781c6560 100644 --- a/docs/admin/runai-setup/cluster-setup/cluster-prerequisites.md +++ b/docs/admin/runai-setup/cluster-setup/cluster-prerequisites.md @@ -69,7 +69,7 @@ For information on supported versions of managed Kubernetes, it's important to c For an up-to-date end-of-life statement of Kubernetes see [Kubernetes Release History](https://kubernetes.io/releases/){target=_blank}. !!! Note - Run:ai allows scheduling of Jobs with PVCs. See for example the command-line interface flag [--pvc-new](../../../Researcher/cli-reference/runai-submit.md#--new-pvc--stringarray). A Job scheduled with a PVC based on a specific type of storage class (a storage class with the property `volumeBindingMode` equals to `WaitForFirstConsumer`) will [not work](https://kubernetes.io/docs/concepts/storage/storage-capacity/){target=_blank} on Kubernetes 1.23 or lower. + Run:ai allows scheduling of Jobs with PVCs. See for example the command-line interface flag [--pvc-new](../../../Researcher/cli-reference/runai-submit.md#-new-pvc-stringarray). A Job scheduled with a PVC based on a specific type of storage class (a storage class with the property `volumeBindingMode` equals to `WaitForFirstConsumer`) will [not work](https://kubernetes.io/docs/concepts/storage/storage-capacity/){target=_blank} on Kubernetes 1.23 or lower. #### Pod Security Admission diff --git a/docs/admin/runai-setup/cluster-setup/customize-cluster-install.md b/docs/admin/runai-setup/cluster-setup/customize-cluster-install.md index 94a8f58d77..64669322e6 100644 --- a/docs/admin/runai-setup/cluster-setup/customize-cluster-install.md +++ b/docs/admin/runai-setup/cluster-setup/customize-cluster-install.md @@ -37,7 +37,7 @@ All customizations will be saved when upgrading the cluster to a future version. | `spec.researcherService.route.tlsSecret` | | On OpenShift, set a dedicated certificate for the researcher service route. When not set, the OpenShift certificate will be used. The value should be a Kubernetes secret in the runai namespace | | `global.image.registry` | | In air-gapped environment, allow cluster images to be pulled from private docker registry. For more information see [self-hosted cluster installation](../self-hosted/k8s/cluster.md#install-cluster) | | `global.additionalImagePullSecrets` | [] | Defines a list of secrets to be used to pull images from a private docker registry | -| `global.nodeAffinity.restrictScheduling` | false | Restrict scheduling of workloads to specific nodes, based on node labels. For more information see [node roles](../config/node-roles.md#dedicated-gpu--cpu-nodes) | +| `global.nodeAffinity.restrictScheduling` | false | Restrict scheduling of workloads to specific nodes, based on node labels. For more information see [node roles](../config/node-roles.md#dedicated-gpu-and-cpu-nodes) | | `spec.prometheus.spec.retention` | 2h | The interval of time where Prometheus will save Run:ai metrics. Promethues is only used as an intermediary to another metrics storage facility and metrics are typically moved within tens of seconds, so changing this setting is mostly for debugging purposes. | | `spec.prometheus.spec.retentionSize` | Not set | The amount of storage allocated for metrics by Prometheus. For more information see [Prometheus Storage](https://prometheus.io/docs/prometheus/latest/storage/#operational-aspects){target=_blank}. | | `spec.prometheus.spec.imagePullSecrets` | Not set | An optional list of references to secrets in the runai namespace to use for pulling Prometheus images (relevant for air-gapped installations). | diff --git a/docs/admin/runai-setup/config/node-roles.md b/docs/admin/runai-setup/config/node-roles.md index 90e6599d34..2b7a4a2e25 100644 --- a/docs/admin/runai-setup/config/node-roles.md +++ b/docs/admin/runai-setup/config/node-roles.md @@ -27,7 +27,7 @@ runai-adm remove node-role --runai-system-worker !!! Warning Do not select the Kubernetes master as a runai-system node. This may cause Kubernetes to stop working (specifically if Kubernetes API Server is configured on 443 instead of the default 6443). -## Dedicated GPU & CPU Nodes +## Dedicated GPU and CPU Nodes !!! Important diff --git a/docs/home/whats-new-2-15.md b/docs/home/whats-new-2-15.md index 4c823b01dc..7155c16ff2 100644 --- a/docs/home/whats-new-2-15.md +++ b/docs/home/whats-new-2-15.md @@ -38,7 +38,7 @@ date: 2023-Dec-3 * Improved support for Kubeflow Notebooks. Run:ai now supports the scheduling of Kubeflow notebooks with fractional GPUs. Kubeflow notebooks are identified automatically and appear with a dedicated icon in the *Jobs* UI. * Improved the *Trainings* and *Workspaces* forms. Now the runtime field for *Command* and *Arguments* can be edited directly in the new *Workspace* or *Training* creation form. -* Added new functionality to the Run:ai CLI that allows submitting a workload with multiple service types at the same time in a CSV style format. Both the CLI and the UI now offer the same functionality. For more information, see [runai submit](../Researcher/cli-reference/runai-submit.md#-s----service-type-string). +* Added new functionality to the Run:ai CLI that allows submitting a workload with multiple service types at the same time in a CSV style format. Both the CLI and the UI now offer the same functionality. For more information, see [runai submit](../Researcher/cli-reference/runai-submit.md#-s-service-type-string). * Improved functionality in the `runai submit` command so that the port for the container is specified using the `nodeport` flag. For more information, see `runai submit` [--service-type](../Researcher/cli-reference/runai-submit.md#-s-service-type-string) `nodeport`. #### Credentials