Skip to content

Reference fixes #887

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jul 31, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 0 additions & 1 deletion docs/Researcher/Walkthroughs/quickstart-overview.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,6 @@ Follow the Quickstart documents below to learn more:
* [Interactive build sessions with externalized services](walkthrough-build-ports.md)
* [Using GPU Fractions](walkthrough-fractions.md)
* [Distributed Training](walkthrough-distributed-training.md)
* [Hyperparameter Optimization](walkthrough-hpo.md)
* [Over-Quota, Basic Fairness & Bin Packing](walkthrough-overquota.md)
* [Fairness](walkthrough-queue-fairness.md)
* [Inference](quickstart-inference.md)
Expand Down
7 changes: 0 additions & 7 deletions docs/Researcher/best-practices/env-variables.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,13 +13,6 @@ Run:ai provides the following environment variables:
Note that the Job can be deleted and then recreated with the same name. A Job UUID will be different even if the Job names are the same.


## Identifying a Pod

With [Hyperparameter Optimization](../Walkthroughs/walkthrough-hpo.md), experiments are run as _Pods_ within the Job. Run:ai provides the following environment variables to identify the Pod.

* ``POD_INDEX`` - An index number (0, 1, 2, 3....) for a specific Pod within the Job. This is useful for Hyperparameter Optimization to allow easy mapping to individual experiments. The Pod index will remain the same if restarted (due to a failure or preemption). Therefore, it can be used by the Researcher to identify experiments.
* ``POD_UUID`` - a unique identifier for the Pod. if the Pod is restarted, the Pod UUID will change.

## GPU Allocation

Run:ai provides an environment variable, visible inside the container, to help identify the number of GPUs allocated for the container. Use `RUNAI_NUM_OF_GPUS`
Expand Down
8 changes: 0 additions & 8 deletions docs/Researcher/cli-reference/runai-submit.md
Original file line number Diff line number Diff line change
Expand Up @@ -50,14 +50,6 @@ runai submit --name frac05 -i gcr.io/run-ai-demo/quickstart -g 0.5

(see: [GPU fractions Quickstart](../Walkthroughs/walkthrough-fractions.md)).

Hyperparameter Optimization

```console
runai submit --name hpo1 -i gcr.io/run-ai-demo/quickstart-hpo -g 1 \
--parallelism 3 --completions 12 -v /nfs/john/hpo:/hpo
```

(see: [hyperparameter optimization Quickstart](../Walkthroughs/walkthrough-hpo.md)).

Submit a Job without a name (automatically generates a name)

Expand Down
2 changes: 0 additions & 2 deletions docs/Researcher/scheduling/the-runai-scheduler.md
Original file line number Diff line number Diff line change
Expand Up @@ -226,5 +226,3 @@ To search for good hyperparameters, Researchers typically start a series of smal

With HPO, the Researcher provides a single script that is used with multiple, varying, parameters. Each run is a *pod* (see definition above). Unlike Gang Scheduling, with HPO, pods are **independent**. They are scheduled independently, started, and end independently, and if preempted, the other pods are unaffected. The scheduling behavior for individual pods is exactly as described in the Scheduler Details section above for Jobs.
In case node pools are enabled, if the HPO workload has been assigned with more than one node pool, the different pods might end up running on different node pools.

For more information on Hyperparameter Optimization in Run:ai see [here](../Walkthroughs/walkthrough-hpo.md)
2 changes: 1 addition & 1 deletion docs/admin/troubleshooting/cluster-health-check.md
Original file line number Diff line number Diff line change
Expand Up @@ -186,7 +186,7 @@ kubectl get cm runai-public -oyaml

### Resources not deployed / System unavailable / Reconciliation failed

1. Run the [Preinstall diagnostic script](cluster-prerequisites.md#pre-install-script) and check for issues.
1. Run the [Preinstall diagnostic script](../runai-setup/cluster-setup/cluster-prerequisites.md#pre-install-script) and check for issues.
2. Run

```
Expand Down
4 changes: 2 additions & 2 deletions docs/admin/workloads/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -121,8 +121,8 @@ To get the full experience of Run:ai’s environment and platform use the follow

* [Workspaces](../../Researcher/user-interface/workspaces/overview.md#getting-familiar-with-workspaces)
* [Trainings](../../Researcher/user-interface/trainings.md#trainings) (Only available when using the *Jobs* view)
* [Distributed trainings](../../Researcher/user-interface/trainings.md#trainings)
* [Deployment](../admin-ui-setup/deployments.md#viewing-and-submitting-deployments)
* [Distributed training](../../Researcher/user-interface/trainings.md#trainings)
* Deployments.

## Workload-related Integrations

Expand Down
7 changes: 3 additions & 4 deletions docs/admin/workloads/inference-overview.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,13 +30,12 @@ Run:ai provides *Inference* services as an equal part together with the other tw

* Multiple replicas will appear in Run:ai as a single *Inference* workload. The workload will appear in all Run:ai dashboards and views as well as the Command-line interface.

* Inference workloads can be submitted via Run:ai [user interface](../admin-ui-setup/deployments.md) as well as [Run:ai API](../../developer/cluster-api/workload-overview-dev.md). Internally, spawning an Inference workload also creates a Kubernetes *Service*. The service is an end-point to which clients can connect.
* Inference workloads can be submitted via Run:ai user interface as well as [Run:ai API](../../developer/cluster-api/workload-overview-dev.md). Internally, spawning an Inference workload also creates a Kubernetes *Service*. The service is an end-point to which clients can connect.

## Autoscaling

To withstand SLA, *Inference* workloads are typically set with *auto scaling*. Auto-scaling is the ability to add more computing power (Kubernetes pods) when the load increases and shrink allocated resources when the system is idle.

There are a number of ways to trigger autoscaling. Run:ai supports the following:
There are several ways to trigger autoscaling. Run:ai supports the following:

| Metric | Units | Run:ai name |
|-----------------|--------------|-----------------|
Expand All @@ -45,7 +44,7 @@ There are a number of ways to trigger autoscaling. Run:ai supports the following

The Minimum and Maximum number of replicas can be configured as part of the autoscaling configuration.

Autoscaling also supports a scale to zero policy with *Throughput* and *Concurrency* metrics, meaning that given enough time under the target threshold, the number of replicas will be scaled down to 0.
Autoscaling also supports a scale-to-zero policy with *Throughput* and *Concurrency* metrics, meaning that given enough time under the target threshold, the number of replicas will be scaled down to 0.

This has the benefit of conserving resources at the risk of a delay from "cold starting" the model when traffic resumes.

Expand Down
4 changes: 0 additions & 4 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -113,9 +113,6 @@ plugins:
'admin/runai-setup/cluster-setup/researcher-authentication.md' : 'admin/runai-setup/authentication/sso.md'
'admin/researcher-setup/cli-troubleshooting.md' : 'admin/troubleshooting/troubleshooting.md'
'developer/deprecated/inference/submit-via-yaml.md' : 'developer/cluster-api/other-resources.md'
'Researcher/researcher-library/rl-hpo-support.md' : 'Researcher/scheduling/hpo.md'
'Researcher/researcher-library/researcher-library-overview.md' : 'Researcher/scheduling/hpo.md'

nav:
- Home:
- 'Overview': 'index.md'
Expand Down Expand Up @@ -217,7 +214,6 @@ nav:
- 'Dashboard Analysis' : 'admin/admin-ui-setup/dashboard-analysis.md'
- 'Jobs' : 'admin/admin-ui-setup/jobs.md'
- 'Credentials' : 'admin/admin-ui-setup/credentials-setup.md'
- 'Deployments' : 'admin/admin-ui-setup/deployments.md'
- 'Templates': 'admin/admin-ui-setup/templates.md'
- 'Troubleshooting' :
- 'Cluster Health' : 'admin/troubleshooting/cluster-health-check.md'
Expand Down
Loading