run-ai · yarongol · Aug 6, 2024 · Aug 6, 2024
diff --git a/docs/Researcher/scheduling/GPU-time-slicing-scheduler.md b/docs/Researcher/scheduling/GPU-time-slicing-scheduler.md
@@ -11,7 +11,7 @@ Run:ai supports simultaneous submission of multiple workloads to a single GPU wh
 
 ## New Time-slicing scheduler by Run:ai
 
-To provide customers with predictable and accurate GPU compute resources scheduling, Run:ai is introducing a new feature called Time-slicing GPU scheduler which adds **fractional compute** capabilities on top of other existing Run:ai **memory fractions** capabilities. Unlike the default NVIDIA GPU orchestrator which doesn’t provide the ability to split or limit the runtime of each workload, Run:ai created a new mechanism that gives each workload **exclusive** access to the full GPU for a **limited** amount of time ([lease time](#timeslicing-plan-and-lease-times)) in each scheduling cycle ([plan time](#timeslicing-plan-and-lease-times)). This cycle repeats itself for the lifetime of the workload.
+To provide customers with predictable and accurate GPU compute resources scheduling, Run:ai is introducing a new feature called Time-slicing GPU scheduler which adds **fractional compute** capabilities on top of other existing Run:ai **memory fractions** capabilities. Unlike the default NVIDIA GPU orchestrator which doesn’t provide the ability to split or limit the runtime of each workload, Run:ai created a new mechanism that gives each workload **exclusive** access to the full GPU for a **limited** amount of time ([lease time](#time-slicing-plan-and-lease-times)) in each scheduling cycle ([plan time](#timeslicing-plan-and-lease-times)). This cycle repeats itself for the lifetime of the workload.
 
 Using the GPU runtime this way guarantees a workload is granted its requested GPU compute resources proportionally to its requested GPU fraction.
 

diff --git a/docs/Researcher/scheduling/dynamic-gpu-fractions.md b/docs/Researcher/scheduling/dynamic-gpu-fractions.md
@@ -71,7 +71,7 @@ The supported values depend on the label used. You can use them in either the UI
 ## Compute Resources UI with Dynamic Fractions support
 
 To enable the UI elements for Dynamic Fractions, press *Settings*, *General*, then open the *Resources* pane and toggle *GPU Resource Optimization*. This enables all the UI features related to *GPU Resource Optimization* for the whole tenant. There are other per cluster or per node-pool configurations that should be configured in order to use the capabilities of ‘GPU Resource Optimization’ See the documentation for each of these features.
-Once the ‘GPU Resource Optimization’ feature is enabled, you will be able to create *Compute Resources* with the *GPU Portion (Fraction)* Limit and *GPU Memory Limit*. In addition, you will be able to view the workloads’ utilization vs. Request and Limit parameters in the [Metrics](../../admin/workloads/submitting-workloads.md#workloads-table) pane for each workload.
+Once the ‘GPU Resource Optimization’ feature is enabled, you will be able to create *Compute Resources* with the *GPU Portion (Fraction)* Limit and *GPU Memory Limit*. In addition, you will be able to view the workloads’ utilization vs. Request and Limit parameters in the Metrics pane for each workload.
 
 ![GPU Limit](img/GPU-resource-limit-enabled.png)
 

diff --git a/docs/admin/performance/dashboard-analysis.md b/docs/admin/performance/dashboard-analysis.md
@@ -25,7 +25,7 @@ These dashboards give system administrators the ability to drill down to see det
 
 There are 5 dashboards:
 
-* [**GPU/CPU Overview**](#gpucpu-overview-dashboard) dashboard&mdash;Provides information about what is happening right now in the cluster.
+* [**GPU/CPU Overview**](#gpucpu-overview-dashboard-new-and-legacy) dashboard&mdash;Provides information about what is happening right now in the cluster.
 * [**Quota Management**](#quota-management-dashboard) dashboard&mdash;Provides information about quota utilization.
 * [**Analytics**](#analytics-dashboard) dashboard&mdash;Provides long term analysis of cluster behavior.
 * [**Multi-Cluster Overview**](#multi-cluster-overview-dashboard) dashboard&mdash;Provides a more holistic, multi-cluster view of what is happening right now. The dashboard is intended for organizations that have more than one connected cluster.

diff --git a/docs/admin/runai-setup/cluster-setup/cluster-install.md b/docs/admin/runai-setup/cluster-setup/cluster-install.md
@@ -33,7 +33,7 @@ On the next page:
 ## Verify your cluster's health
 
 * Verify that the cluster status in the Run:ai Control Plane's [Clusters Table](#cluster-table) is `Connected`.
-* Go to the [Overview Dashboard](../../performance/dashboard-analysis.md#overview-dashboard) and verify that the number of GPUs on the top right reflects your GPU resources on your cluster and the list of machines with GPU resources appears on the bottom line.
+* Go to the [Overview Dashboard](../../performance/dashboard-analysis.md#gpucpu-overview-dashboard-new-and-legacy) and verify that the number of GPUs on the top right reflects your GPU resources on your cluster and the list of machines with GPU resources appears on the bottom line.
 * In case of issues, see the [Troubleshooting guide](../../troubleshooting/cluster-health-check.md).
 
 ## Researcher Authentication
@@ -69,7 +69,7 @@ The following table describes the different statuses that a cluster could be in.
 | Service issues | At least one of the *Services* is not working properly. You can view the list of nonfunctioning services for more information |
 | Connected | All services are connected and up and running. |
 
-See the [Troubleshooting guide](../../troubleshooting/cluster-health-check.md#verifying-cluster-health) to help troubleshoot issues in the cluster.
+See the [Troubleshooting guide](../../troubleshooting/cluster-health-check.md) to help troubleshoot issues in the cluster.
 
 ## Customize your installation
 

diff --git a/docs/admin/runai-setup/cluster-setup/cluster-prerequisites.md b/docs/admin/runai-setup/cluster-setup/cluster-prerequisites.md
@@ -69,7 +69,7 @@ For information on supported versions of managed Kubernetes, it's important to c
 For an up-to-date end-of-life statement of Kubernetes see [Kubernetes Release History](https://kubernetes.io/releases/){target=_blank}.
 
 !!! Note
-    Run:ai allows scheduling of Jobs with PVCs. See for example the command-line interface flag [--pvc-new](../../../Researcher/cli-reference/runai-submit.md#new-pvc-stringarray). A Job scheduled with a PVC based on a specific type of storage class (a storage class with the property `volumeBindingMode` equals to `WaitForFirstConsumer`) will [not work](https://kubernetes.io/docs/concepts/storage/storage-capacity/){target=_blank} on Kubernetes 1.23 or lower.
+    Run:ai allows scheduling of Jobs with PVCs. See for example the command-line interface flag [--pvc-new](../../../Researcher/cli-reference/runai-submit.md#--new-pvc--stringarray). A Job scheduled with a PVC based on a specific type of storage class (a storage class with the property `volumeBindingMode` equals to `WaitForFirstConsumer`) will [not work](https://kubernetes.io/docs/concepts/storage/storage-capacity/){target=_blank} on Kubernetes 1.23 or lower.
 
 #### Pod Security Admission
 

diff --git a/docs/admin/runai-setup/cluster-setup/cluster-upgrade.md b/docs/admin/runai-setup/cluster-setup/cluster-upgrade.md
@@ -71,7 +71,7 @@ The process:
 
 ## Verify Successful Installation
 
-See [Verify your installation](cluster-install.md#verify-your-installation) on how to verify a Run:ai cluster installation
+See [Verify your installation](cluster-install.md#verify-your-clusters-health) on how to verify a Run:ai cluster installation
 
 
 
diff --git a/docs/admin/runai-setup/config/dr.md b/docs/admin/runai-setup/config/dr.md
@@ -33,7 +33,7 @@ Run:ai stores metric history using [Thanos](https://github.com/thanos-io/thanos)
 
 ### Backing up Control-Plane Configuration
 
-The installation of the Run:ai control plane can be [configured](../self-hosted/k8s/backend.md#optional-additional-configurations). The configuration is provided as `--set` command in the helm installation. These changes will be preserved on upgrade, but will not be preserved on uninstall or on damage to Kubernetes. Thus, it is best to back up these customizations. For a list of customizations used during the installation, run:
+The installation of the Run:ai control plane can be [configured](../self-hosted/k8s/backend.md#additional-runai-configurations-optional). The configuration is provided as `--set` command in the helm installation. These changes will be preserved on upgrade, but will not be preserved on uninstall or upon damage to Kubernetes. Thus, it is best to back up these customizations. For a list of customizations used during the installation, run:
 
 `helm get values runai-backend -n runai-backend`
 

diff --git a/docs/admin/runai-setup/config/ha.md b/docs/admin/runai-setup/config/ha.md
@@ -11,7 +11,7 @@ A different scenario is a high transaction load, leading to system overload. To
 
 ### Run:ai system workers
 
-The Run:ai control plane allows the **optional** [gathering of Run:ai pods into specific nodes](../self-hosted/k8s/preparations.md#optional-mark-runai-system-workers). When this feature is used, it is important to set more than one node as a Run:ai system worker. Otherwise, the horizontal scaling described below will not span multiple nodes, and the system will remain with a single point of failure.  
+The Run:ai control plane allows the **optional** [gathering of Run:ai pods into specific nodes](../self-hosted/k8s/preparations.md#mark-runai-system-workers-optional). When this feature is used, it is important to set more than one node as a Run:ai system worker. Otherwise, the horizontal scaling described below will not span multiple nodes, and the system will remain with a single point of failure.  
 
 ### Horizontal Scalability of Run:ai services
 
@@ -40,7 +40,7 @@ Run:ai uses three third parties which are managed as Kubernetes StatefulSets:
 
 ### Run:ai system workers
 
-The Run:ai cluster allows the **mandatory** [gathering of Run:ai pods into specific nodes](../self-hosted/k8s/preparations.md#optional-mark-runai-system-workers). When this feature is used, it is important to set more than one node as a Run:ai system worker. Otherwise, the horizontal scaling described below may not span multiple nodes, and the system will remain with a single point of failure.  
+The Run:ai cluster allows the **mandatory** [gathering of Run:ai pods into specific nodes](../self-hosted/k8s/preparations.md#mark-runai-system-workers-optional). When this feature is used, it is important to set more than one node as a Run:ai system worker. Otherwise, the horizontal scaling described below may not span multiple nodes, and the system will remain with a single point of failure.  
 
 ### Prometheus
 

diff --git a/docs/admin/runai-setup/config/org-cert.md b/docs/admin/runai-setup/config/org-cert.md
@@ -24,7 +24,7 @@ kubectl -n runai-backend create secret generic runai-ca-cert \
     --from-file=runai-ca.pem=<ca_bundle_path>
 ```
 
-* As part of the installation instructions you need to create a secret for [runai-backend-tls](../self-hosted/k8s/backend.md#domain-certificate). Use the local certificate authority instead.
+* As part of the installation instructions, you need to create a secret for [runai-backend-tls](../self-hosted/k8s/preparations.md#domain-certificate). Use the local certificate authority instead.
 * Install the control plane, add the following flag to the helm command `--set global.customCA.enabled=true`
 
 ## Cluster Installation

diff --git a/docs/admin/runai-setup/maintenance/node-downtime.md b/docs/admin/runai-setup/maintenance/node-downtime.md
@@ -64,7 +64,7 @@ kubectl taint nodes <node-name> runai=drain:NoExecute-
 kubectl delete node <node-name>
 ```
 
-However, if you plan to bring back the node, you will need to rejoin the node into the cluster. See [Rejoin](#Rejoin-a-Node-into-the-Kubernetes-Cluster).
+However, if you plan to bring back the node, you will need to rejoin the node into the cluster. See [Rejoin](#rejoin-a-node-into-the-kubernetes-cluster).
 
 
 

diff --git a/docs/admin/runai-setup/self-hosted/k8s/backend.md b/docs/admin/runai-setup/self-hosted/k8s/backend.md
@@ -17,7 +17,7 @@ Run the helm command below:
         --set global.domain=<DOMAIN>  # (1)
     ```
 
-    1. Domain name described [here](prerequisites.md#domain-name). 
+    1. Domain name described [here](preparations.md#domain-certificate). 
 
     !!! Info
         To install a specific version, add `--version <version>` to the install command. You can find available versions by running `helm search repo -l runai-backend`.

diff --git a/docs/admin/runai-setup/self-hosted/k8s/cluster.md b/docs/admin/runai-setup/self-hosted/k8s/cluster.md
@@ -22,7 +22,7 @@ Install prerequisites as per [cluster prerequisites](../../cluster-setup/cluster
     * Do not add the helm repository and do not run `helm repo update`.
     * Instead, edit the `helm upgrade` command. 
         * Replace `runai/runai-cluster` with `runai-cluster-<version>.tgz`. 
-        * Add  `--set global.image.registry=<Docker Registry address>` where the registry address is as entered in the [preparation section](./preparations.md#runai-software-files)
+        * Add  `--set global.image.registry=<Docker Registry address>` where the registry address is as entered in the [preparation section](./preparations.md#software-artifacts)
 
     The command should look like the following:
 

diff --git a/docs/admin/runai-setup/self-hosted/k8s/preparations.md b/docs/admin/runai-setup/self-hosted/k8s/preparations.md
@@ -96,7 +96,7 @@ kubectl label node <NODE-NAME> node-role.kubernetes.io/runai-system=true
 
 ### External Postgres database (optional)
 
-If you have opted to use an [external PostgreSQL database](prerequisites.md#external-postgresql-database-optional), you need to perform initial setup to ensure successful installation. Follow these steps:
+If you have opted to use an [external PostgreSQL database](prerequisites.md#external-postgres-database-optional), you need to perform initial setup to ensure successful installation. Follow these steps:
 
 1. Create a SQL script file, edit the parameters below, and save it locally:
     * Replace `<DATABASE_NAME>` with a dedicate database name for RunAi in your PostgreSQL database.

diff --git a/docs/admin/runai-setup/self-hosted/k8s/project-management.md b/docs/admin/runai-setup/self-hosted/k8s/project-management.md
@@ -22,7 +22,7 @@ This process may **need to be altered** if,
 
 Run:ai allows the **association** of a Run:ai Project with any existing Kubernetes namespace:
 
-* When [setting up](cluster.md#customize-installation) a Run:ai cluster, Disable namespace creation by setting the cluster flag `createNamespaces` to `false`.
+* When [setting up](cluster.md#optional-customize-installation) a Run:ai cluster, Disable namespace creation by setting the cluster flag `createNamespaces` to `false`.
 * Using the Run:ai User Interface, create a new Project `<PROJECT-NAME>`. A namespace will **not** be created.
 * Associate and existing namepace `<NAMESPACE>` with the Run:ai project by running:
 

diff --git a/docs/admin/runai-setup/self-hosted/k8s/upgrade.md b/docs/admin/runai-setup/self-hosted/k8s/upgrade.md
@@ -30,7 +30,7 @@ If you are installing an air-gapped version of Run:ai, The Run:ai tar file conta
 
 === "Airgapped"
     * Ask for a tar file `runai-air-gapped-<NEW-VERSION>.tar.gz` from Run:ai customer support. The file contains the new version you want to upgrade to. `<NEW-VERSION>` is the updated version of the Run:ai control plane.
-    * Upload the images as described [here](preparations.md#runai-software-files).
+    * Upload the images as described [here](preparations.md#software-artifacts).
 
 ## Before upgrade
 
@@ -94,7 +94,7 @@ kubectl delete ing -n runai-backend runai-backend-ingress
 The Run:ai control-plane installation has been rewritten and is no longer using a _backend values file_. Instead, to customize the installation use standard `--set` flags. If you have previously customized the installation, you must now extract these customizations and add them as `--set` flag to the helm installation:
 
 * Find previous customizations to the control plane if such exist. Run:ai provides a utility for that here `https://raw.githubusercontent.com/run-ai/docs/v2.13/install/backend/cp-helm-vals-diff.sh`. For information on how to use this utility please contact Run:ai customer support. 
-* Search for the customizations you found in the [optional configurations](./backend.md#optional-additional-configurations) table and add them in the new format. 
+* Search for the customizations you found in the [optional configurations](./backend.md#additional-runai-configurations-optional) table and add them in the new format. 
 
 
 ## Upgrade Control Plane

diff --git a/docs/admin/runai-setup/self-hosted/ocp/cluster.md b/docs/admin/runai-setup/self-hosted/ocp/cluster.md
@@ -48,7 +48,7 @@ The last namespace (`runai-scale-adjust`) is only required if the cluster is a c
     * Do not add the helm repository and do not run `helm repo update`.
     * Instead, edit the `helm upgrade` command. 
         * Replace `runai/runai-cluster` with `runai-cluster-<version>.tgz`. 
-        * Add  `--set global.image.registry=<Docker Registry address>` where the registry address is as entered in the [preparation section](./preparations.md#runai-software-files)
+        * Add  `--set global.image.registry=<Docker Registry address>` where the registry address is as entered in the [preparation section](./preparations.md#software-artifacts)
         * Add `--set global.customCA.enabled=true` and perform the instructions for [local certificate authority](../../config/org-cert.md).
 
     The command should look like the following:

diff --git a/docs/admin/runai-setup/self-hosted/ocp/upgrade.md b/docs/admin/runai-setup/self-hosted/ocp/upgrade.md
@@ -29,7 +29,7 @@ If you are installing an air-gapped version of Run:ai, The Run:ai tar file conta
 
 === "Airgapped" 
     * Ask for a tar file `runai-air-gapped-<NEW-VERSION>.tar.gz` from Run:ai customer support. The file contains the new version you want to upgrade to. `<NEW-VERSION>` is the updated version of the Run:ai control plane.
-    * Upload the images as described [here](preparations.md#runai-software-files).
+    * Upload the images as described [here](preparations.md#software-artifacts).
 
 ## Before upgrade
 
@@ -47,7 +47,7 @@ kubectl delete secret -n runai-backend runai-backend-postgresql
 kubectl delete sts -n runai-backend keycloak runai-backend-postgresql
 ```
 
-Then upgrade the control plane as described [below](#upgrade-the-control-plane). Before upgrading, find customizations and merge them as discussed below. 
+Then upgrade the control plane as described [below](#upgrade-control-plane). Before upgrading, find customizations and merge them as discussed below. 
 
 ### Upgrade from version 2.9, 2.10 or 2.11
 
@@ -72,7 +72,7 @@ kubectl patch pvc -n runai-backend pvc-postgresql  -p '{"metadata": {"annotation
 The Run:ai control-plane installation has been rewritten and is no longer using a _backend values file_. Instead, to customize the installation use standard `--set` flags. If you have previously customized the installation, you must now extract these customizations and add them as `--set` flag to the helm installation:
 
 * Find previous customizations to the control plane if such exist. Run:ai provides a utility for that here `https://raw.githubusercontent.com/run-ai/docs/v2.13/install/backend/cp-helm-vals-diff.sh`. For information on how to use this utility please contact Run:ai customer support. 
-* Search for the customizations you found in the [optional configurations](./backend.md#optional-additional-configurations) table and add them in the new format.  
+* Search for the customizations you found in the [optional configurations](./backend.md#additional-runai-configurations-optional) table and add them in the new format.  
 
 
 ## Upgrade Control Plane

diff --git a/docs/admin/troubleshooting/troubleshooting.md b/docs/admin/troubleshooting/troubleshooting.md
@@ -61,7 +61,7 @@
 
     Add verbosity to Prometheus as describe [here](diagnostics.md).Verify that there are no errors. If there are connectivity-related errors you may need to:
 
-    * Check your firewall for outbound connections. See the required permitted URL list in [Network requirements](../runai-setup/cluster-setup/cluster-prerequisites.md#network-access-requirements.md).
+    * Check your firewall for outbound connections. See the required permitted URL list in [Network requirements](../runai-setup/cluster-setup/cluster-prerequisites.md#network-access-requirements).
     * If you need to set up an internet proxy or certificate, please contact Run:ai customer support. 
 
 
@@ -250,7 +250,7 @@
     __Resolution__
 
     * Run: `runai pods -n runai | grep agent`. See that the agent is in _Running_ state. Select the agent's full name and run: `kubectl logs -n runai runai-agent-<id>`.
-    * Verify that there are no errors. If there are connectivity-related errors you may need to check your firewall for outbound connections. See the required permitted URL list in [Network requirements](../runai-setup/cluster-setup/cluster-prerequisites.md#network-requirements). 
+    * Verify that there are no errors. If there are connectivity-related errors you may need to check your firewall for outbound connections. See the required permitted URL list in [Network requirements](../runai-setup/cluster-setup/cluster-prerequisites.md#network-access-requirements). 
     * If you need to set up an internet proxy or certificate, please contact Run:ai customer support. 
 
 ??? "Jobs are not syncing"

diff --git a/docs/admin/workloads/README.md b/docs/admin/workloads/README.md
@@ -31,7 +31,7 @@ Third party integrations are tools that Run:ai supports and manages. These are t
 1. Smart gang scheduling (workload aware).
 2. Specific workload aware visibility so that different kinds of pods are identified as a single workload (for example, GPU Utilization, workload view, dashboards).
 
-For more information, see [Supported integrations](#supported-integrations).
+For more information, see [Supported integrations](#third-party-integrations).
 
 ### Typical Kubernetes workloads
Original file line number	Diff line number	Diff line change
Expand Up		@@ -71,7 +71,7 @@ The process:

		## Verify Successful Installation

		See [Verify your installation](cluster-install.md#verify-your-installation) on how to verify a Run:ai cluster installation
		See [Verify your installation](cluster-install.md#verify-your-clusters-health) on how to verify a Run:ai cluster installation
-Original file line number
+Diff line change
@@ Expand Up / @@ -64,7 +64,7 @@ kubectl taint nodes <node-name> runai=drain:NoExecute- @@
     kubectl delete node <node-name>
     ```
-    However, if you plan to bring back the node, you will need to rejoin the node into the cluster. See [Rejoin](#Rejoin-a-Node-into-the-Kubernetes-Cluster).
+    However, if you plan to bring back the node, you will need to rejoin the node into the cluster. See [Rejoin](#rejoin-a-node-into-the-kubernetes-cluster).
@@ Expand Down @@