diff --git a/docs/Researcher/Walkthroughs/quickstart-overview.md b/docs/Researcher/Walkthroughs/quickstart-overview.md index 7e8ada7d47..ef25b7a14f 100644 --- a/docs/Researcher/Walkthroughs/quickstart-overview.md +++ b/docs/Researcher/Walkthroughs/quickstart-overview.md @@ -7,7 +7,6 @@ Follow the Quickstart documents below to learn more: * [Interactive build sessions with externalized services](walkthrough-build-ports.md) * [Using GPU Fractions](walkthrough-fractions.md) * [Distributed Training](walkthrough-distributed-training.md) -* [Hyperparameter Optimization](walkthrough-hpo.md) * [Over-Quota, Basic Fairness & Bin Packing](walkthrough-overquota.md) * [Fairness](walkthrough-queue-fairness.md) * [Inference](quickstart-inference.md) diff --git a/docs/Researcher/best-practices/env-variables.md b/docs/Researcher/best-practices/env-variables.md index a235e989a6..a131a32a28 100644 --- a/docs/Researcher/best-practices/env-variables.md +++ b/docs/Researcher/best-practices/env-variables.md @@ -13,13 +13,6 @@ Run:ai provides the following environment variables: Note that the Job can be deleted and then recreated with the same name. A Job UUID will be different even if the Job names are the same. -## Identifying a Pod - -With [Hyperparameter Optimization](../Walkthroughs/walkthrough-hpo.md), experiments are run as _Pods_ within the Job. Run:ai provides the following environment variables to identify the Pod. - -* ``POD_INDEX`` - An index number (0, 1, 2, 3....) for a specific Pod within the Job. This is useful for Hyperparameter Optimization to allow easy mapping to individual experiments. The Pod index will remain the same if restarted (due to a failure or preemption). Therefore, it can be used by the Researcher to identify experiments. -* ``POD_UUID`` - a unique identifier for the Pod. if the Pod is restarted, the Pod UUID will change. - ## GPU Allocation Run:ai provides an environment variable, visible inside the container, to help identify the number of GPUs allocated for the container. Use `RUNAI_NUM_OF_GPUS` diff --git a/docs/Researcher/cli-reference/runai-submit.md b/docs/Researcher/cli-reference/runai-submit.md index fe026c01ab..4426884676 100644 --- a/docs/Researcher/cli-reference/runai-submit.md +++ b/docs/Researcher/cli-reference/runai-submit.md @@ -50,14 +50,6 @@ runai submit --name frac05 -i gcr.io/run-ai-demo/quickstart -g 0.5 (see: [GPU fractions Quickstart](../Walkthroughs/walkthrough-fractions.md)). -Hyperparameter Optimization - -```console -runai submit --name hpo1 -i gcr.io/run-ai-demo/quickstart-hpo -g 1 \ - --parallelism 3 --completions 12 -v /nfs/john/hpo:/hpo -``` - -(see: [hyperparameter optimization Quickstart](../Walkthroughs/walkthrough-hpo.md)). Submit a Job without a name (automatically generates a name) diff --git a/docs/Researcher/scheduling/the-runai-scheduler.md b/docs/Researcher/scheduling/the-runai-scheduler.md index b8f0550a8d..804fea145a 100644 --- a/docs/Researcher/scheduling/the-runai-scheduler.md +++ b/docs/Researcher/scheduling/the-runai-scheduler.md @@ -226,5 +226,3 @@ To search for good hyperparameters, Researchers typically start a series of smal With HPO, the Researcher provides a single script that is used with multiple, varying, parameters. Each run is a *pod* (see definition above). Unlike Gang Scheduling, with HPO, pods are **independent**. They are scheduled independently, started, and end independently, and if preempted, the other pods are unaffected. The scheduling behavior for individual pods is exactly as described in the Scheduler Details section above for Jobs. In case node pools are enabled, if the HPO workload has been assigned with more than one node pool, the different pods might end up running on different node pools. - -For more information on Hyperparameter Optimization in Run:ai see [here](../Walkthroughs/walkthrough-hpo.md) diff --git a/docs/admin/troubleshooting/cluster-health-check.md b/docs/admin/troubleshooting/cluster-health-check.md index 78e7018616..3c204fd850 100644 --- a/docs/admin/troubleshooting/cluster-health-check.md +++ b/docs/admin/troubleshooting/cluster-health-check.md @@ -186,7 +186,7 @@ kubectl get cm runai-public -oyaml ### Resources not deployed / System unavailable / Reconciliation failed -1. Run the [Preinstall diagnostic script](cluster-prerequisites.md#pre-install-script) and check for issues. +1. Run the [Preinstall diagnostic script](../runai-setup/cluster-setup/cluster-prerequisites.md#pre-install-script) and check for issues. 2. Run ``` diff --git a/docs/admin/workloads/README.md b/docs/admin/workloads/README.md index 556b945104..125df62201 100644 --- a/docs/admin/workloads/README.md +++ b/docs/admin/workloads/README.md @@ -121,8 +121,8 @@ To get the full experience of Run:ai’s environment and platform use the follow * [Workspaces](../../Researcher/user-interface/workspaces/overview.md#getting-familiar-with-workspaces) * [Trainings](../../Researcher/user-interface/trainings.md#trainings) (Only available when using the *Jobs* view) -* [Distributed trainings](../../Researcher/user-interface/trainings.md#trainings) -* [Deployment](../admin-ui-setup/deployments.md#viewing-and-submitting-deployments) +* [Distributed training](../../Researcher/user-interface/trainings.md#trainings) +* Deployments. ## Workload-related Integrations diff --git a/docs/admin/workloads/inference-overview.md b/docs/admin/workloads/inference-overview.md index 5c84085b91..5bf8e4e147 100644 --- a/docs/admin/workloads/inference-overview.md +++ b/docs/admin/workloads/inference-overview.md @@ -30,13 +30,12 @@ Run:ai provides *Inference* services as an equal part together with the other tw * Multiple replicas will appear in Run:ai as a single *Inference* workload. The workload will appear in all Run:ai dashboards and views as well as the Command-line interface. -* Inference workloads can be submitted via Run:ai [user interface](../admin-ui-setup/deployments.md) as well as [Run:ai API](../../developer/cluster-api/workload-overview-dev.md). Internally, spawning an Inference workload also creates a Kubernetes *Service*. The service is an end-point to which clients can connect. +* Inference workloads can be submitted via Run:ai user interface as well as [Run:ai API](../../developer/cluster-api/workload-overview-dev.md). Internally, spawning an Inference workload also creates a Kubernetes *Service*. The service is an end-point to which clients can connect. ## Autoscaling To withstand SLA, *Inference* workloads are typically set with *auto scaling*. Auto-scaling is the ability to add more computing power (Kubernetes pods) when the load increases and shrink allocated resources when the system is idle. - -There are a number of ways to trigger autoscaling. Run:ai supports the following: +There are several ways to trigger autoscaling. Run:ai supports the following: | Metric | Units | Run:ai name | |-----------------|--------------|-----------------| @@ -45,7 +44,7 @@ There are a number of ways to trigger autoscaling. Run:ai supports the following The Minimum and Maximum number of replicas can be configured as part of the autoscaling configuration. -Autoscaling also supports a scale to zero policy with *Throughput* and *Concurrency* metrics, meaning that given enough time under the target threshold, the number of replicas will be scaled down to 0. +Autoscaling also supports a scale-to-zero policy with *Throughput* and *Concurrency* metrics, meaning that given enough time under the target threshold, the number of replicas will be scaled down to 0. This has the benefit of conserving resources at the risk of a delay from "cold starting" the model when traffic resumes. diff --git a/mkdocs.yml b/mkdocs.yml index dd0620c3f3..1cea279e31 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -113,9 +113,6 @@ plugins: 'admin/runai-setup/cluster-setup/researcher-authentication.md' : 'admin/runai-setup/authentication/sso.md' 'admin/researcher-setup/cli-troubleshooting.md' : 'admin/troubleshooting/troubleshooting.md' 'developer/deprecated/inference/submit-via-yaml.md' : 'developer/cluster-api/other-resources.md' - 'Researcher/researcher-library/rl-hpo-support.md' : 'Researcher/scheduling/hpo.md' - 'Researcher/researcher-library/researcher-library-overview.md' : 'Researcher/scheduling/hpo.md' - nav: - Home: - 'Overview': 'index.md' @@ -217,7 +214,6 @@ nav: - 'Dashboard Analysis' : 'admin/admin-ui-setup/dashboard-analysis.md' - 'Jobs' : 'admin/admin-ui-setup/jobs.md' - 'Credentials' : 'admin/admin-ui-setup/credentials-setup.md' - - 'Deployments' : 'admin/admin-ui-setup/deployments.md' - 'Templates': 'admin/admin-ui-setup/templates.md' - 'Troubleshooting' : - 'Cluster Health' : 'admin/troubleshooting/cluster-health-check.md'