Skip to content

Commit a017ff8

Browse files
committed
Merge pull request #915 from run-ai/fix-mkdocs-info-issues
fix-anchors
1 parent 5b30193 commit a017ff8

File tree

25 files changed

+36
-37
lines changed

25 files changed

+36
-37
lines changed

docs/Researcher/scheduling/GPU-time-slicing-scheduler.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@ Run:ai supports simultaneous submission of multiple workloads to a single GPU wh
1111

1212
## New Time-slicing scheduler by Run:ai
1313

14-
To provide customers with predictable and accurate GPU compute resources scheduling, Run:ai is introducing a new feature called Time-slicing GPU scheduler which adds **fractional compute** capabilities on top of other existing Run:ai **memory fractions** capabilities. Unlike the default NVIDIA GPU orchestrator which doesn’t provide the ability to split or limit the runtime of each workload, Run:ai created a new mechanism that gives each workload **exclusive** access to the full GPU for a **limited** amount of time ([lease time](#timeslicing-plan-and-lease-times)) in each scheduling cycle ([plan time](#timeslicing-plan-and-lease-times)). This cycle repeats itself for the lifetime of the workload.
14+
To provide customers with predictable and accurate GPU compute resources scheduling, Run:ai is introducing a new feature called Time-slicing GPU scheduler which adds **fractional compute** capabilities on top of other existing Run:ai **memory fractions** capabilities. Unlike the default NVIDIA GPU orchestrator which doesn’t provide the ability to split or limit the runtime of each workload, Run:ai created a new mechanism that gives each workload **exclusive** access to the full GPU for a **limited** amount of time ([lease time](#time-slicing-plan-and-lease-times)) in each scheduling cycle ([plan time](#timeslicing-plan-and-lease-times)). This cycle repeats itself for the lifetime of the workload.
1515

1616
Using the GPU runtime this way guarantees a workload is granted its requested GPU compute resources proportionally to its requested GPU fraction.
1717

docs/Researcher/scheduling/dynamic-gpu-fractions.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -71,7 +71,7 @@ The supported values depend on the label used. You can use them in either the UI
7171
## Compute Resources UI with Dynamic Fractions support
7272

7373
To enable the UI elements for Dynamic Fractions, press *Settings*, *General*, then open the *Resources* pane and toggle *GPU Resource Optimization*. This enables all the UI features related to *GPU Resource Optimization* for the whole tenant. There are other per cluster or per node-pool configurations that should be configured in order to use the capabilities of ‘GPU Resource Optimization’ See the documentation for each of these features.
74-
Once the ‘GPU Resource Optimization’ feature is enabled, you will be able to create *Compute Resources* with the *GPU Portion (Fraction)* Limit and *GPU Memory Limit*. In addition, you will be able to view the workloads’ utilization vs. Request and Limit parameters in the [Metrics](../../admin/workloads/submitting-workloads.md#workloads-table) pane for each workload.
74+
Once the ‘GPU Resource Optimization’ feature is enabled, you will be able to create *Compute Resources* with the *GPU Portion (Fraction)* Limit and *GPU Memory Limit*. In addition, you will be able to view the workloads’ utilization vs. Request and Limit parameters in the Metrics pane for each workload.
7575

7676
![GPU Limit](img/GPU-resource-limit-enabled.png)
7777

docs/admin/performance/dashboard-analysis.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -25,7 +25,7 @@ These dashboards give system administrators the ability to drill down to see det
2525

2626
There are 5 dashboards:
2727

28-
* [**GPU/CPU Overview**](#gpucpu-overview-dashboard) dashboard—Provides information about what is happening right now in the cluster.
28+
* [**GPU/CPU Overview**](#gpucpu-overview-dashboard-new-and-legacy) dashboard—Provides information about what is happening right now in the cluster.
2929
* [**Quota Management**](#quota-management-dashboard) dashboard—Provides information about quota utilization.
3030
* [**Analytics**](#analytics-dashboard) dashboard—Provides long term analysis of cluster behavior.
3131
* [**Multi-Cluster Overview**](#multi-cluster-overview-dashboard) dashboard—Provides a more holistic, multi-cluster view of what is happening right now. The dashboard is intended for organizations that have more than one connected cluster.

docs/admin/runai-setup/cluster-setup/cluster-install.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -33,7 +33,7 @@ On the next page:
3333
## Verify your cluster's health
3434

3535
* Verify that the cluster status in the Run:ai Control Plane's [Clusters Table](#cluster-table) is `Connected`.
36-
* Go to the [Overview Dashboard](../../performance/dashboard-analysis.md#overview-dashboard) and verify that the number of GPUs on the top right reflects your GPU resources on your cluster and the list of machines with GPU resources appears on the bottom line.
36+
* Go to the [Overview Dashboard](../../performance/dashboard-analysis.md#gpucpu-overview-dashboard-new-and-legacy) and verify that the number of GPUs on the top right reflects your GPU resources on your cluster and the list of machines with GPU resources appears on the bottom line.
3737
* In case of issues, see the [Troubleshooting guide](../../troubleshooting/cluster-health-check.md).
3838

3939
## Researcher Authentication
@@ -69,7 +69,7 @@ The following table describes the different statuses that a cluster could be in.
6969
| Service issues | At least one of the *Services* is not working properly. You can view the list of nonfunctioning services for more information |
7070
| Connected | All services are connected and up and running. |
7171

72-
See the [Troubleshooting guide](../../troubleshooting/cluster-health-check.md#verifying-cluster-health) to help troubleshoot issues in the cluster.
72+
See the [Troubleshooting guide](../../troubleshooting/cluster-health-check.md) to help troubleshoot issues in the cluster.
7373

7474
## Customize your installation
7575

docs/admin/runai-setup/cluster-setup/cluster-prerequisites.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -69,7 +69,7 @@ For information on supported versions of managed Kubernetes, it's important to c
6969
For an up-to-date end-of-life statement of Kubernetes see [Kubernetes Release History](https://kubernetes.io/releases/){target=_blank}.
7070

7171
!!! Note
72-
Run:ai allows scheduling of Jobs with PVCs. See for example the command-line interface flag [--pvc-new](../../../Researcher/cli-reference/runai-submit.md#new-pvc-stringarray). A Job scheduled with a PVC based on a specific type of storage class (a storage class with the property `volumeBindingMode` equals to `WaitForFirstConsumer`) will [not work](https://kubernetes.io/docs/concepts/storage/storage-capacity/){target=_blank} on Kubernetes 1.23 or lower.
72+
Run:ai allows scheduling of Jobs with PVCs. See for example the command-line interface flag [--pvc-new](../../../Researcher/cli-reference/runai-submit.md#--new-pvc--stringarray). A Job scheduled with a PVC based on a specific type of storage class (a storage class with the property `volumeBindingMode` equals to `WaitForFirstConsumer`) will [not work](https://kubernetes.io/docs/concepts/storage/storage-capacity/){target=_blank} on Kubernetes 1.23 or lower.
7373

7474
#### Pod Security Admission
7575

docs/admin/runai-setup/cluster-setup/cluster-upgrade.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -71,7 +71,7 @@ The process:
7171
7272
## Verify Successful Installation
7373
74-
See [Verify your installation](cluster-install.md#verify-your-installation) on how to verify a Run:ai cluster installation
74+
See [Verify your installation](cluster-install.md#verify-your-clusters-health) on how to verify a Run:ai cluster installation
7575
7676
7777

docs/admin/runai-setup/config/dr.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -33,7 +33,7 @@ Run:ai stores metric history using [Thanos](https://github.com/thanos-io/thanos)
3333

3434
### Backing up Control-Plane Configuration
3535

36-
The installation of the Run:ai control plane can be [configured](../self-hosted/k8s/backend.md#optional-additional-configurations). The configuration is provided as `--set` command in the helm installation. These changes will be preserved on upgrade, but will not be preserved on uninstall or on damage to Kubernetes. Thus, it is best to back up these customizations. For a list of customizations used during the installation, run:
36+
The installation of the Run:ai control plane can be [configured](../self-hosted/k8s/backend.md#additional-runai-configurations-optional). The configuration is provided as `--set` command in the helm installation. These changes will be preserved on upgrade, but will not be preserved on uninstall or upon damage to Kubernetes. Thus, it is best to back up these customizations. For a list of customizations used during the installation, run:
3737

3838
`helm get values runai-backend -n runai-backend`
3939

docs/admin/runai-setup/config/ha.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@ A different scenario is a high transaction load, leading to system overload. To
1111

1212
### Run:ai system workers
1313

14-
The Run:ai control plane allows the **optional** [gathering of Run:ai pods into specific nodes](../self-hosted/k8s/preparations.md#optional-mark-runai-system-workers). When this feature is used, it is important to set more than one node as a Run:ai system worker. Otherwise, the horizontal scaling described below will not span multiple nodes, and the system will remain with a single point of failure.
14+
The Run:ai control plane allows the **optional** [gathering of Run:ai pods into specific nodes](../self-hosted/k8s/preparations.md#mark-runai-system-workers-optional). When this feature is used, it is important to set more than one node as a Run:ai system worker. Otherwise, the horizontal scaling described below will not span multiple nodes, and the system will remain with a single point of failure.
1515

1616
### Horizontal Scalability of Run:ai services
1717

@@ -40,7 +40,7 @@ Run:ai uses three third parties which are managed as Kubernetes StatefulSets:
4040
4141
### Run:ai system workers
4242
43-
The Run:ai cluster allows the **mandatory** [gathering of Run:ai pods into specific nodes](../self-hosted/k8s/preparations.md#optional-mark-runai-system-workers). When this feature is used, it is important to set more than one node as a Run:ai system worker. Otherwise, the horizontal scaling described below may not span multiple nodes, and the system will remain with a single point of failure.
43+
The Run:ai cluster allows the **mandatory** [gathering of Run:ai pods into specific nodes](../self-hosted/k8s/preparations.md#mark-runai-system-workers-optional). When this feature is used, it is important to set more than one node as a Run:ai system worker. Otherwise, the horizontal scaling described below may not span multiple nodes, and the system will remain with a single point of failure.
4444
4545
### Prometheus
4646

docs/admin/runai-setup/config/org-cert.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -24,7 +24,7 @@ kubectl -n runai-backend create secret generic runai-ca-cert \
2424
--from-file=runai-ca.pem=<ca_bundle_path>
2525
```
2626

27-
* As part of the installation instructions you need to create a secret for [runai-backend-tls](../self-hosted/k8s/backend.md#domain-certificate). Use the local certificate authority instead.
27+
* As part of the installation instructions, you need to create a secret for [runai-backend-tls](../self-hosted/k8s/preparations.md#domain-certificate). Use the local certificate authority instead.
2828
* Install the control plane, add the following flag to the helm command `--set global.customCA.enabled=true`
2929

3030
## Cluster Installation

docs/admin/runai-setup/maintenance/node-downtime.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -64,7 +64,7 @@ kubectl taint nodes <node-name> runai=drain:NoExecute-
6464
kubectl delete node <node-name>
6565
```
6666

67-
However, if you plan to bring back the node, you will need to rejoin the node into the cluster. See [Rejoin](#Rejoin-a-Node-into-the-Kubernetes-Cluster).
67+
However, if you plan to bring back the node, you will need to rejoin the node into the cluster. See [Rejoin](#rejoin-a-node-into-the-kubernetes-cluster).
6868

6969

7070

0 commit comments

Comments
 (0)