Skip to content

achor-fixes #919

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Aug 7, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion docs/Researcher/scheduling/GPU-time-slicing-scheduler.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ Run:ai supports simultaneous submission of multiple workloads to a single GPU wh

## New Time-slicing scheduler by Run:ai

To provide customers with predictable and accurate GPU compute resources scheduling, Run:ai is introducing a new feature called Time-slicing GPU scheduler which adds **fractional compute** capabilities on top of other existing Run:ai **memory fractions** capabilities. Unlike the default NVIDIA GPU orchestrator which doesn’t provide the ability to split or limit the runtime of each workload, Run:ai created a new mechanism that gives each workload **exclusive** access to the full GPU for a **limited** amount of time ([lease time](#time-slicing-plan-and-lease-times)) in each scheduling cycle ([plan time](#timeslicing-plan-and-lease-times)). This cycle repeats itself for the lifetime of the workload.
To provide customers with predictable and accurate GPU compute resources scheduling, Run:ai is introducing a new feature called Time-slicing GPU scheduler which adds **fractional compute** capabilities on top of other existing Run:ai **memory fractions** capabilities. Unlike the default NVIDIA GPU orchestrator which doesn’t provide the ability to split or limit the runtime of each workload, Run:ai created a new mechanism that gives each workload **exclusive** access to the full GPU for a **limited** amount of time ([lease time](#time-slicing-plan-and-lease-times)) in each scheduling cycle ([plan time](#time-slicing-plan-and-lease-times)). This cycle repeats itself for the lifetime of the workload.

Using the GPU runtime this way guarantees a workload is granted its requested GPU compute resources proportionally to its requested GPU fraction.

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -69,7 +69,7 @@ For information on supported versions of managed Kubernetes, it's important to c
For an up-to-date end-of-life statement of Kubernetes see [Kubernetes Release History](https://kubernetes.io/releases/){target=_blank}.

!!! Note
Run:ai allows scheduling of Jobs with PVCs. See for example the command-line interface flag [--pvc-new](../../../Researcher/cli-reference/runai-submit.md#--new-pvc--stringarray). A Job scheduled with a PVC based on a specific type of storage class (a storage class with the property `volumeBindingMode` equals to `WaitForFirstConsumer`) will [not work](https://kubernetes.io/docs/concepts/storage/storage-capacity/){target=_blank} on Kubernetes 1.23 or lower.
Run:ai allows scheduling of Jobs with PVCs. See for example the command-line interface flag [--pvc-new](../../../Researcher/cli-reference/runai-submit.md#-new-pvc-stringarray). A Job scheduled with a PVC based on a specific type of storage class (a storage class with the property `volumeBindingMode` equals to `WaitForFirstConsumer`) will [not work](https://kubernetes.io/docs/concepts/storage/storage-capacity/){target=_blank} on Kubernetes 1.23 or lower.

#### Pod Security Admission

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,7 @@ All customizations will be saved when upgrading the cluster to a future version.
| `spec.researcherService.route.tlsSecret` | | On OpenShift, set a dedicated certificate for the researcher service route. When not set, the OpenShift certificate will be used. The value should be a Kubernetes secret in the runai namespace |
| `global.image.registry` | | In air-gapped environment, allow cluster images to be pulled from private docker registry. For more information see [self-hosted cluster installation](../self-hosted/k8s/cluster.md#install-cluster) |
| `global.additionalImagePullSecrets` | [] | Defines a list of secrets to be used to pull images from a private docker registry |
| `global.nodeAffinity.restrictScheduling` | false | Restrict scheduling of workloads to specific nodes, based on node labels. For more information see [node roles](../config/node-roles.md#dedicated-gpu--cpu-nodes) |
| `global.nodeAffinity.restrictScheduling` | false | Restrict scheduling of workloads to specific nodes, based on node labels. For more information see [node roles](../config/node-roles.md#dedicated-gpu-and-cpu-nodes) |
| `spec.prometheus.spec.retention` | 2h | The interval of time where Prometheus will save Run:ai metrics. Promethues is only used as an intermediary to another metrics storage facility and metrics are typically moved within tens of seconds, so changing this setting is mostly for debugging purposes. |
| `spec.prometheus.spec.retentionSize` | Not set | The amount of storage allocated for metrics by Prometheus. For more information see [Prometheus Storage](https://prometheus.io/docs/prometheus/latest/storage/#operational-aspects){target=_blank}. |
| `spec.prometheus.spec.imagePullSecrets` | Not set | An optional list of references to secrets in the runai namespace to use for pulling Prometheus images (relevant for air-gapped installations). |
Expand Down
2 changes: 1 addition & 1 deletion docs/admin/runai-setup/config/node-roles.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@ runai-adm remove node-role --runai-system-worker <node-name>
!!! Warning
Do not select the Kubernetes master as a runai-system node. This may cause Kubernetes to stop working (specifically if Kubernetes API Server is configured on 443 instead of the default 6443).

## Dedicated GPU & CPU Nodes
## Dedicated GPU and CPU Nodes


!!! Important
Expand Down
2 changes: 1 addition & 1 deletion docs/home/whats-new-2-15.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,7 @@ date: 2023-Dec-3

* <!-- RUN-10639/RUN-11389 - Researcher Service Refactoring RUN-12505/RUN-12506 - Support Kubeflow notebooks for scheduling/orchestration -->Improved support for Kubeflow Notebooks. Run:ai now supports the scheduling of Kubeflow notebooks with fractional GPUs. Kubeflow notebooks are identified automatically and appear with a dedicated icon in the *Jobs* UI.
* <!-- RUN-11292/RUN-11592 General changes in favor of any asset based workload \(WS, training, DT\)-->Improved the *Trainings* and *Workspaces* forms. Now the runtime field for *Command* and *Arguments* can be edited directly in the new *Workspace* or *Training* creation form.
* <!-- RUN-10235/RUN-10485 Support multi service types in the CLI submission -->Added new functionality to the Run:ai CLI that allows submitting a workload with multiple service types at the same time in a CSV style format. Both the CLI and the UI now offer the same functionality. For more information, see [runai submit](../Researcher/cli-reference/runai-submit.md#-s----service-type-string).
* <!-- RUN-10235/RUN-10485 Support multi service types in the CLI submission -->Added new functionality to the Run:ai CLI that allows submitting a workload with multiple service types at the same time in a CSV style format. Both the CLI and the UI now offer the same functionality. For more information, see [runai submit](../Researcher/cli-reference/runai-submit.md#-s-service-type-string).
* <!-- RUN-10335/RUN-10510 Node port command line -->Improved functionality in the `runai submit` command so that the port for the container is specified using the `nodeport` flag. For more information, see `runai submit` [--service-type](../Researcher/cli-reference/runai-submit.md#-s-service-type-string) `nodeport`.

#### Credentials
Expand Down
Loading