Skip to content

fix-anchors #915

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Aug 6, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion docs/Researcher/scheduling/GPU-time-slicing-scheduler.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ Run:ai supports simultaneous submission of multiple workloads to a single GPU wh

## New Time-slicing scheduler by Run:ai

To provide customers with predictable and accurate GPU compute resources scheduling, Run:ai is introducing a new feature called Time-slicing GPU scheduler which adds **fractional compute** capabilities on top of other existing Run:ai **memory fractions** capabilities. Unlike the default NVIDIA GPU orchestrator which doesn’t provide the ability to split or limit the runtime of each workload, Run:ai created a new mechanism that gives each workload **exclusive** access to the full GPU for a **limited** amount of time ([lease time](#timeslicing-plan-and-lease-times)) in each scheduling cycle ([plan time](#timeslicing-plan-and-lease-times)). This cycle repeats itself for the lifetime of the workload.
To provide customers with predictable and accurate GPU compute resources scheduling, Run:ai is introducing a new feature called Time-slicing GPU scheduler which adds **fractional compute** capabilities on top of other existing Run:ai **memory fractions** capabilities. Unlike the default NVIDIA GPU orchestrator which doesn’t provide the ability to split or limit the runtime of each workload, Run:ai created a new mechanism that gives each workload **exclusive** access to the full GPU for a **limited** amount of time ([lease time](#time-slicing-plan-and-lease-times)) in each scheduling cycle ([plan time](#timeslicing-plan-and-lease-times)). This cycle repeats itself for the lifetime of the workload.

Using the GPU runtime this way guarantees a workload is granted its requested GPU compute resources proportionally to its requested GPU fraction.

Expand Down
2 changes: 1 addition & 1 deletion docs/Researcher/scheduling/dynamic-gpu-fractions.md
Original file line number Diff line number Diff line change
Expand Up @@ -71,7 +71,7 @@ The supported values depend on the label used. You can use them in either the UI
## Compute Resources UI with Dynamic Fractions support

To enable the UI elements for Dynamic Fractions, press *Settings*, *General*, then open the *Resources* pane and toggle *GPU Resource Optimization*. This enables all the UI features related to *GPU Resource Optimization* for the whole tenant. There are other per cluster or per node-pool configurations that should be configured in order to use the capabilities of ‘GPU Resource Optimization’ See the documentation for each of these features.
Once the ‘GPU Resource Optimization’ feature is enabled, you will be able to create *Compute Resources* with the *GPU Portion (Fraction)* Limit and *GPU Memory Limit*. In addition, you will be able to view the workloads’ utilization vs. Request and Limit parameters in the [Metrics](../../admin/workloads/submitting-workloads.md#workloads-table) pane for each workload.
Once the ‘GPU Resource Optimization’ feature is enabled, you will be able to create *Compute Resources* with the *GPU Portion (Fraction)* Limit and *GPU Memory Limit*. In addition, you will be able to view the workloads’ utilization vs. Request and Limit parameters in the Metrics pane for each workload.

![GPU Limit](img/GPU-resource-limit-enabled.png)

Expand Down
2 changes: 1 addition & 1 deletion docs/admin/performance/dashboard-analysis.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@ These dashboards give system administrators the ability to drill down to see det

There are 5 dashboards:

* [**GPU/CPU Overview**](#gpucpu-overview-dashboard) dashboard—Provides information about what is happening right now in the cluster.
* [**GPU/CPU Overview**](#gpucpu-overview-dashboard-new-and-legacy) dashboard—Provides information about what is happening right now in the cluster.
* [**Quota Management**](#quota-management-dashboard) dashboard—Provides information about quota utilization.
* [**Analytics**](#analytics-dashboard) dashboard—Provides long term analysis of cluster behavior.
* [**Multi-Cluster Overview**](#multi-cluster-overview-dashboard) dashboard—Provides a more holistic, multi-cluster view of what is happening right now. The dashboard is intended for organizations that have more than one connected cluster.
Expand Down
4 changes: 2 additions & 2 deletions docs/admin/runai-setup/cluster-setup/cluster-install.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,7 @@ On the next page:
## Verify your cluster's health

* Verify that the cluster status in the Run:ai Control Plane's [Clusters Table](#cluster-table) is `Connected`.
* Go to the [Overview Dashboard](../../performance/dashboard-analysis.md#overview-dashboard) and verify that the number of GPUs on the top right reflects your GPU resources on your cluster and the list of machines with GPU resources appears on the bottom line.
* Go to the [Overview Dashboard](../../performance/dashboard-analysis.md#gpucpu-overview-dashboard-new-and-legacy) and verify that the number of GPUs on the top right reflects your GPU resources on your cluster and the list of machines with GPU resources appears on the bottom line.
* In case of issues, see the [Troubleshooting guide](../../troubleshooting/cluster-health-check.md).

## Researcher Authentication
Expand Down Expand Up @@ -69,7 +69,7 @@ The following table describes the different statuses that a cluster could be in.
| Service issues | At least one of the *Services* is not working properly. You can view the list of nonfunctioning services for more information |
| Connected | All services are connected and up and running. |

See the [Troubleshooting guide](../../troubleshooting/cluster-health-check.md#verifying-cluster-health) to help troubleshoot issues in the cluster.
See the [Troubleshooting guide](../../troubleshooting/cluster-health-check.md) to help troubleshoot issues in the cluster.

## Customize your installation

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -69,7 +69,7 @@ For information on supported versions of managed Kubernetes, it's important to c
For an up-to-date end-of-life statement of Kubernetes see [Kubernetes Release History](https://kubernetes.io/releases/){target=_blank}.

!!! Note
Run:ai allows scheduling of Jobs with PVCs. See for example the command-line interface flag [--pvc-new](../../../Researcher/cli-reference/runai-submit.md#new-pvc-stringarray). A Job scheduled with a PVC based on a specific type of storage class (a storage class with the property `volumeBindingMode` equals to `WaitForFirstConsumer`) will [not work](https://kubernetes.io/docs/concepts/storage/storage-capacity/){target=_blank} on Kubernetes 1.23 or lower.
Run:ai allows scheduling of Jobs with PVCs. See for example the command-line interface flag [--pvc-new](../../../Researcher/cli-reference/runai-submit.md#--new-pvc--stringarray). A Job scheduled with a PVC based on a specific type of storage class (a storage class with the property `volumeBindingMode` equals to `WaitForFirstConsumer`) will [not work](https://kubernetes.io/docs/concepts/storage/storage-capacity/){target=_blank} on Kubernetes 1.23 or lower.

#### Pod Security Admission

Expand Down
2 changes: 1 addition & 1 deletion docs/admin/runai-setup/cluster-setup/cluster-upgrade.md
Original file line number Diff line number Diff line change
Expand Up @@ -71,7 +71,7 @@ The process:

## Verify Successful Installation

See [Verify your installation](cluster-install.md#verify-your-installation) on how to verify a Run:ai cluster installation
See [Verify your installation](cluster-install.md#verify-your-clusters-health) on how to verify a Run:ai cluster installation



2 changes: 1 addition & 1 deletion docs/admin/runai-setup/config/dr.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,7 @@ Run:ai stores metric history using [Thanos](https://github.com/thanos-io/thanos)

### Backing up Control-Plane Configuration

The installation of the Run:ai control plane can be [configured](../self-hosted/k8s/backend.md#optional-additional-configurations). The configuration is provided as `--set` command in the helm installation. These changes will be preserved on upgrade, but will not be preserved on uninstall or on damage to Kubernetes. Thus, it is best to back up these customizations. For a list of customizations used during the installation, run:
The installation of the Run:ai control plane can be [configured](../self-hosted/k8s/backend.md#additional-runai-configurations-optional). The configuration is provided as `--set` command in the helm installation. These changes will be preserved on upgrade, but will not be preserved on uninstall or upon damage to Kubernetes. Thus, it is best to back up these customizations. For a list of customizations used during the installation, run:

`helm get values runai-backend -n runai-backend`

Expand Down
4 changes: 2 additions & 2 deletions docs/admin/runai-setup/config/ha.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ A different scenario is a high transaction load, leading to system overload. To

### Run:ai system workers

The Run:ai control plane allows the **optional** [gathering of Run:ai pods into specific nodes](../self-hosted/k8s/preparations.md#optional-mark-runai-system-workers). When this feature is used, it is important to set more than one node as a Run:ai system worker. Otherwise, the horizontal scaling described below will not span multiple nodes, and the system will remain with a single point of failure.
The Run:ai control plane allows the **optional** [gathering of Run:ai pods into specific nodes](../self-hosted/k8s/preparations.md#mark-runai-system-workers-optional). When this feature is used, it is important to set more than one node as a Run:ai system worker. Otherwise, the horizontal scaling described below will not span multiple nodes, and the system will remain with a single point of failure.

### Horizontal Scalability of Run:ai services

Expand Down Expand Up @@ -40,7 +40,7 @@ Run:ai uses three third parties which are managed as Kubernetes StatefulSets:

### Run:ai system workers

The Run:ai cluster allows the **mandatory** [gathering of Run:ai pods into specific nodes](../self-hosted/k8s/preparations.md#optional-mark-runai-system-workers). When this feature is used, it is important to set more than one node as a Run:ai system worker. Otherwise, the horizontal scaling described below may not span multiple nodes, and the system will remain with a single point of failure.
The Run:ai cluster allows the **mandatory** [gathering of Run:ai pods into specific nodes](../self-hosted/k8s/preparations.md#mark-runai-system-workers-optional). When this feature is used, it is important to set more than one node as a Run:ai system worker. Otherwise, the horizontal scaling described below may not span multiple nodes, and the system will remain with a single point of failure.

### Prometheus

Expand Down
2 changes: 1 addition & 1 deletion docs/admin/runai-setup/config/org-cert.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ kubectl -n runai-backend create secret generic runai-ca-cert \
--from-file=runai-ca.pem=<ca_bundle_path>
```

* As part of the installation instructions you need to create a secret for [runai-backend-tls](../self-hosted/k8s/backend.md#domain-certificate). Use the local certificate authority instead.
* As part of the installation instructions, you need to create a secret for [runai-backend-tls](../self-hosted/k8s/preparations.md#domain-certificate). Use the local certificate authority instead.
* Install the control plane, add the following flag to the helm command `--set global.customCA.enabled=true`

## Cluster Installation
Expand Down
2 changes: 1 addition & 1 deletion docs/admin/runai-setup/maintenance/node-downtime.md
Original file line number Diff line number Diff line change
Expand Up @@ -64,7 +64,7 @@ kubectl taint nodes <node-name> runai=drain:NoExecute-
kubectl delete node <node-name>
```

However, if you plan to bring back the node, you will need to rejoin the node into the cluster. See [Rejoin](#Rejoin-a-Node-into-the-Kubernetes-Cluster).
However, if you plan to bring back the node, you will need to rejoin the node into the cluster. See [Rejoin](#rejoin-a-node-into-the-kubernetes-cluster).



Expand Down
2 changes: 1 addition & 1 deletion docs/admin/runai-setup/self-hosted/k8s/backend.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ Run the helm command below:
--set global.domain=<DOMAIN> # (1)
```

1. Domain name described [here](prerequisites.md#domain-name).
1. Domain name described [here](preparations.md#domain-certificate).

!!! Info
To install a specific version, add `--version <version>` to the install command. You can find available versions by running `helm search repo -l runai-backend`.
Expand Down
2 changes: 1 addition & 1 deletion docs/admin/runai-setup/self-hosted/k8s/cluster.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ Install prerequisites as per [cluster prerequisites](../../cluster-setup/cluster
* Do not add the helm repository and do not run `helm repo update`.
* Instead, edit the `helm upgrade` command.
* Replace `runai/runai-cluster` with `runai-cluster-<version>.tgz`.
* Add `--set global.image.registry=<Docker Registry address>` where the registry address is as entered in the [preparation section](./preparations.md#runai-software-files)
* Add `--set global.image.registry=<Docker Registry address>` where the registry address is as entered in the [preparation section](./preparations.md#software-artifacts)

The command should look like the following:

Expand Down
2 changes: 1 addition & 1 deletion docs/admin/runai-setup/self-hosted/k8s/preparations.md
Original file line number Diff line number Diff line change
Expand Up @@ -96,7 +96,7 @@ kubectl label node <NODE-NAME> node-role.kubernetes.io/runai-system=true

### External Postgres database (optional)

If you have opted to use an [external PostgreSQL database](prerequisites.md#external-postgresql-database-optional), you need to perform initial setup to ensure successful installation. Follow these steps:
If you have opted to use an [external PostgreSQL database](prerequisites.md#external-postgres-database-optional), you need to perform initial setup to ensure successful installation. Follow these steps:

1. Create a SQL script file, edit the parameters below, and save it locally:
* Replace `<DATABASE_NAME>` with a dedicate database name for RunAi in your PostgreSQL database.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ This process may **need to be altered** if,

Run:ai allows the **association** of a Run:ai Project with any existing Kubernetes namespace:

* When [setting up](cluster.md#customize-installation) a Run:ai cluster, Disable namespace creation by setting the cluster flag `createNamespaces` to `false`.
* When [setting up](cluster.md#optional-customize-installation) a Run:ai cluster, Disable namespace creation by setting the cluster flag `createNamespaces` to `false`.
* Using the Run:ai User Interface, create a new Project `<PROJECT-NAME>`. A namespace will **not** be created.
* Associate and existing namepace `<NAMESPACE>` with the Run:ai project by running:

Expand Down
4 changes: 2 additions & 2 deletions docs/admin/runai-setup/self-hosted/k8s/upgrade.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@ If you are installing an air-gapped version of Run:ai, The Run:ai tar file conta

=== "Airgapped"
* Ask for a tar file `runai-air-gapped-<NEW-VERSION>.tar.gz` from Run:ai customer support. The file contains the new version you want to upgrade to. `<NEW-VERSION>` is the updated version of the Run:ai control plane.
* Upload the images as described [here](preparations.md#runai-software-files).
* Upload the images as described [here](preparations.md#software-artifacts).

## Before upgrade

Expand Down Expand Up @@ -94,7 +94,7 @@ kubectl delete ing -n runai-backend runai-backend-ingress
The Run:ai control-plane installation has been rewritten and is no longer using a _backend values file_. Instead, to customize the installation use standard `--set` flags. If you have previously customized the installation, you must now extract these customizations and add them as `--set` flag to the helm installation:

* Find previous customizations to the control plane if such exist. Run:ai provides a utility for that here `https://raw.githubusercontent.com/run-ai/docs/v2.13/install/backend/cp-helm-vals-diff.sh`. For information on how to use this utility please contact Run:ai customer support.
* Search for the customizations you found in the [optional configurations](./backend.md#optional-additional-configurations) table and add them in the new format.
* Search for the customizations you found in the [optional configurations](./backend.md#additional-runai-configurations-optional) table and add them in the new format.


## Upgrade Control Plane
Expand Down
2 changes: 1 addition & 1 deletion docs/admin/runai-setup/self-hosted/ocp/cluster.md
Original file line number Diff line number Diff line change
Expand Up @@ -48,7 +48,7 @@ The last namespace (`runai-scale-adjust`) is only required if the cluster is a c
* Do not add the helm repository and do not run `helm repo update`.
* Instead, edit the `helm upgrade` command.
* Replace `runai/runai-cluster` with `runai-cluster-<version>.tgz`.
* Add `--set global.image.registry=<Docker Registry address>` where the registry address is as entered in the [preparation section](./preparations.md#runai-software-files)
* Add `--set global.image.registry=<Docker Registry address>` where the registry address is as entered in the [preparation section](./preparations.md#software-artifacts)
* Add `--set global.customCA.enabled=true` and perform the instructions for [local certificate authority](../../config/org-cert.md).

The command should look like the following:
Expand Down
6 changes: 3 additions & 3 deletions docs/admin/runai-setup/self-hosted/ocp/upgrade.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@ If you are installing an air-gapped version of Run:ai, The Run:ai tar file conta

=== "Airgapped"
* Ask for a tar file `runai-air-gapped-<NEW-VERSION>.tar.gz` from Run:ai customer support. The file contains the new version you want to upgrade to. `<NEW-VERSION>` is the updated version of the Run:ai control plane.
* Upload the images as described [here](preparations.md#runai-software-files).
* Upload the images as described [here](preparations.md#software-artifacts).

## Before upgrade

Expand All @@ -47,7 +47,7 @@ kubectl delete secret -n runai-backend runai-backend-postgresql
kubectl delete sts -n runai-backend keycloak runai-backend-postgresql
```

Then upgrade the control plane as described [below](#upgrade-the-control-plane). Before upgrading, find customizations and merge them as discussed below.
Then upgrade the control plane as described [below](#upgrade-control-plane). Before upgrading, find customizations and merge them as discussed below.

### Upgrade from version 2.9, 2.10 or 2.11

Expand All @@ -72,7 +72,7 @@ kubectl patch pvc -n runai-backend pvc-postgresql -p '{"metadata": {"annotation
The Run:ai control-plane installation has been rewritten and is no longer using a _backend values file_. Instead, to customize the installation use standard `--set` flags. If you have previously customized the installation, you must now extract these customizations and add them as `--set` flag to the helm installation:

* Find previous customizations to the control plane if such exist. Run:ai provides a utility for that here `https://raw.githubusercontent.com/run-ai/docs/v2.13/install/backend/cp-helm-vals-diff.sh`. For information on how to use this utility please contact Run:ai customer support.
* Search for the customizations you found in the [optional configurations](./backend.md#optional-additional-configurations) table and add them in the new format.
* Search for the customizations you found in the [optional configurations](./backend.md#additional-runai-configurations-optional) table and add them in the new format.


## Upgrade Control Plane
Expand Down
4 changes: 2 additions & 2 deletions docs/admin/troubleshooting/troubleshooting.md
Original file line number Diff line number Diff line change
Expand Up @@ -61,7 +61,7 @@

Add verbosity to Prometheus as describe [here](diagnostics.md).Verify that there are no errors. If there are connectivity-related errors you may need to:

* Check your firewall for outbound connections. See the required permitted URL list in [Network requirements](../runai-setup/cluster-setup/cluster-prerequisites.md#network-access-requirements.md).
* Check your firewall for outbound connections. See the required permitted URL list in [Network requirements](../runai-setup/cluster-setup/cluster-prerequisites.md#network-access-requirements).
* If you need to set up an internet proxy or certificate, please contact Run:ai customer support.


Expand Down Expand Up @@ -250,7 +250,7 @@
__Resolution__

* Run: `runai pods -n runai | grep agent`. See that the agent is in _Running_ state. Select the agent's full name and run: `kubectl logs -n runai runai-agent-<id>`.
* Verify that there are no errors. If there are connectivity-related errors you may need to check your firewall for outbound connections. See the required permitted URL list in [Network requirements](../runai-setup/cluster-setup/cluster-prerequisites.md#network-requirements).
* Verify that there are no errors. If there are connectivity-related errors you may need to check your firewall for outbound connections. See the required permitted URL list in [Network requirements](../runai-setup/cluster-setup/cluster-prerequisites.md#network-access-requirements).
* If you need to set up an internet proxy or certificate, please contact Run:ai customer support.

??? "Jobs are not syncing"
Expand Down
2 changes: 1 addition & 1 deletion docs/admin/workloads/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@ Third party integrations are tools that Run:ai supports and manages. These are t
1. Smart gang scheduling (workload aware).
2. Specific workload aware visibility so that different kinds of pods are identified as a single workload (for example, GPU Utilization, workload view, dashboards).

For more information, see [Supported integrations](#supported-integrations).
For more information, see [Supported integrations](#third-party-integrations).

### Typical Kubernetes workloads

Expand Down
Loading
Loading