Merge pull request #887 from run-ai/reference-fixes

yarongol · yarongol · commit c2eb3a7c6769 · 2024-07-31T09:21:58.000+03:00
Reference fixes
diff --git a/docs/Researcher/Walkthroughs/quickstart-overview.md b/docs/Researcher/Walkthroughs/quickstart-overview.md
@@ -7,7 +7,6 @@ Follow the Quickstart documents below to learn more:
 * [Interactive build sessions with externalized services](walkthrough-build-ports.md)
 * [Using GPU Fractions](walkthrough-fractions.md)
 * [Distributed Training](walkthrough-distributed-training.md)
-* [Hyperparameter Optimization](walkthrough-hpo.md)
 * [Over-Quota, Basic Fairness & Bin Packing](walkthrough-overquota.md)
 * [Fairness](walkthrough-queue-fairness.md)
 * [Inference](quickstart-inference.md)
diff --git a/docs/Researcher/best-practices/env-variables.md b/docs/Researcher/best-practices/env-variables.md
@@ -13,13 +13,6 @@ Run:ai provides the following environment variables:
 Note that the Job can be deleted and then recreated with the same name. A Job UUID will be different even if the Job names are the same.
 
 
-## Identifying a Pod 
-
-With [Hyperparameter Optimization](../Walkthroughs/walkthrough-hpo.md), experiments are run as _Pods_ within the Job. Run:ai provides the following environment variables to identify the Pod.
-
-* ``POD_INDEX`` -  An index number (0, 1, 2, 3....) for a specific Pod within the Job. This is useful for Hyperparameter Optimization to allow easy mapping to individual experiments. The Pod index will remain the same if restarted (due to a failure or preemption). Therefore, it can be used by the Researcher to identify experiments. 
-* ``POD_UUID`` - a unique identifier for the Pod. if the Pod is restarted, the Pod UUID will change.
-
 ## GPU Allocation
 
 Run:ai provides an environment variable, visible inside the container, to help identify the number of GPUs allocated for the container. Use `RUNAI_NUM_OF_GPUS`
diff --git a/docs/Researcher/cli-reference/runai-submit.md b/docs/Researcher/cli-reference/runai-submit.md
@@ -50,14 +50,6 @@ runai submit --name frac05 -i gcr.io/run-ai-demo/quickstart -g 0.5
 
 (see: [GPU fractions Quickstart](../Walkthroughs/walkthrough-fractions.md)).
 
-Hyperparameter Optimization
-
-```console
-runai submit --name hpo1 -i gcr.io/run-ai-demo/quickstart-hpo -g 1  \
-   --parallelism 3 --completions 12 -v /nfs/john/hpo:/hpo 
-```
-
-(see: [hyperparameter optimization Quickstart](../Walkthroughs/walkthrough-hpo.md)).
 
 Submit a Job without a name (automatically generates a name)
 
diff --git a/docs/Researcher/scheduling/the-runai-scheduler.md b/docs/Researcher/scheduling/the-runai-scheduler.md
@@ -226,5 +226,3 @@ To search for good hyperparameters, Researchers typically start a series of smal
 
 With HPO, the Researcher provides a single script that is used with multiple, varying, parameters. Each run is a *pod* (see definition above). Unlike Gang Scheduling, with HPO, pods are **independent**. They are scheduled independently, started, and end independently, and if preempted, the other pods are unaffected. The scheduling behavior for individual pods is exactly as described in the Scheduler Details section above for Jobs.
 In case node pools are enabled, if the HPO workload has been assigned with more than one node pool, the different pods might end up running on different node pools.
-
-For more information on Hyperparameter Optimization in Run:ai see [here](../Walkthroughs/walkthrough-hpo.md)
diff --git a/docs/admin/troubleshooting/cluster-health-check.md b/docs/admin/troubleshooting/cluster-health-check.md
@@ -186,7 +186,7 @@ kubectl get cm runai-public -oyaml
 
 ### Resources not deployed / System unavailable / Reconciliation failed
 
-1. Run the [Preinstall diagnostic script](cluster-prerequisites.md#pre-install-script) and check for issues.
+1. Run the [Preinstall diagnostic script](../runai-setup/cluster-setup/cluster-prerequisites.md#pre-install-script) and check for issues.
 2. Run
 
 ```
diff --git a/docs/admin/workloads/README.md b/docs/admin/workloads/README.md
@@ -121,8 +121,8 @@ To get the full experience of Run:ai’s environment and platform use the follow
 
 * [Workspaces](../../Researcher/user-interface/workspaces/overview.md#getting-familiar-with-workspaces)
 * [Trainings](../../Researcher/user-interface/trainings.md#trainings) (Only available when using the *Jobs* view)
-* [Distributed trainings](../../Researcher/user-interface/trainings.md#trainings)
-* [Deployment](../admin-ui-setup/deployments.md#viewing-and-submitting-deployments)
+* [Distributed training](../../Researcher/user-interface/trainings.md#trainings)
+* Deployments.
 
 ## Supported integrations
 
diff --git a/docs/admin/workloads/inference-overview.md b/docs/admin/workloads/inference-overview.md
@@ -30,13 +30,12 @@ Run:ai provides *Inference* services as an equal part together with the other tw
 
 * Multiple replicas will appear in Run:ai as a single *Inference* workload. The workload will appear in all Run:ai dashboards and views as well as the Command-line interface.
 
-* Inference workloads can be submitted via Run:ai [user interface](../admin-ui-setup/deployments.md) as well as [Run:ai API](../../developer/cluster-api/workload-overview-dev.md). Internally, spawning an Inference workload also creates a Kubernetes *Service*. The service is an end-point to which clients can connect.
+* Inference workloads can be submitted via Run:ai user interface as well as [Run:ai API](../../developer/cluster-api/workload-overview-dev.md). Internally, spawning an Inference workload also creates a Kubernetes *Service*. The service is an end-point to which clients can connect.
 
 ## Autoscaling
 
 To withstand SLA, *Inference* workloads are typically set with *auto scaling*. Auto-scaling is the ability to add more computing power (Kubernetes pods) when the load increases and shrink allocated resources when the system is idle.
-
-There are a number of ways to trigger autoscaling. Run:ai supports the following:
+There are several ways to trigger autoscaling. Run:ai supports the following:
 
 | Metric          | Units        |   Run:ai name   |
 |-----------------|--------------|-----------------|
@@ -45,7 +44,7 @@ There are a number of ways to trigger autoscaling. Run:ai supports the following
 
 The Minimum and Maximum number of replicas can be configured as part of the autoscaling configuration.
 
-Autoscaling also supports a scale to zero policy with *Throughput* and *Concurrency* metrics, meaning that given enough time under the target threshold, the number of replicas will be scaled down to 0.
+Autoscaling also supports a scale-to-zero policy with *Throughput* and *Concurrency* metrics, meaning that given enough time under the target threshold, the number of replicas will be scaled down to 0.
 
 This has the benefit of conserving resources at the risk of a delay from "cold starting" the model when traffic resumes.
 
diff --git a/mkdocs.yml b/mkdocs.yml
@@ -113,9 +113,6 @@ plugins:
         'admin/runai-setup/cluster-setup/researcher-authentication.md' : 'admin/runai-setup/authentication/sso.md'
         'admin/researcher-setup/cli-troubleshooting.md' : 'admin/troubleshooting/troubleshooting.md'
         'developer/deprecated/inference/submit-via-yaml.md' : 'developer/cluster-api/other-resources.md'
-        'Researcher/researcher-library/rl-hpo-support.md' : 'Researcher/scheduling/hpo.md'
-        'Researcher/researcher-library/researcher-library-overview.md' : 'Researcher/scheduling/hpo.md'
-
 nav:
   - Home: 
     - 'Overview': 'index.md'
@@ -217,7 +214,6 @@ nav:
       - 'Dashboard Analysis' : 'admin/admin-ui-setup/dashboard-analysis.md'
       - 'Jobs' : 'admin/admin-ui-setup/jobs.md'
       - 'Credentials' : 'admin/admin-ui-setup/credentials-setup.md'
-      - 'Deployments' : 'admin/admin-ui-setup/deployments.md'
       - 'Templates': 'admin/admin-ui-setup/templates.md'
     - 'Troubleshooting' : 
       - 'Cluster Health' : 'admin/troubleshooting/cluster-health-check.md'