Skip to content

Existing PVC #897

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 26 commits into from
Aug 1, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
26 commits
Select commit Hold shift + click to select a range
f3b590a
Update whats-new-2-18.md
jasonnovichRunAI Jul 22, 2024
fd48a07
Update hotfixes-2-16.md
JamieWeider72 Jul 23, 2024
97868b1
Merge pull request #873 from run-ai/JamieWeider72-patch-4
JamieWeider72 Jul 23, 2024
58f088d
policies-example
yarongol Jul 23, 2024
c882e05
Merge pull request #875 from run-ai/policies-example-update
yarongol Jul 23, 2024
4d42143
Update automated-publish-docs.yaml
haimlevy2006 Jul 23, 2024
81ca00e
Update automated-publish-docs.yaml
haimlevy2006 Jul 23, 2024
26a3854
Update automated-publish-docs.yaml
haimlevy2006 Jul 23, 2024
a2fa4e0
Update automated-publish-docs.yaml
haimlevy2006 Jul 23, 2024
3573c58
Update whats-new-2-18.md
JamieWeider72 Jul 25, 2024
309a58c
Merge pull request #877 from run-ai/JamieWeider72-patch-7
JamieWeider72 Jul 25, 2024
6f9dcd9
RUN-19295 added policy for workloads in api
JamieWeider72 Jul 30, 2024
cc64a08
Merge branch 'v2.18' into RUN-19295-Addition
JamieWeider72 Jul 30, 2024
85aac77
Merge pull request #884 from JamieWeider72/RUN-19295-Addition
JamieWeider72 Jul 30, 2024
5e11b51
integrations
yarongol Jul 30, 2024
b80caa4
integrations
yarongol Jul 30, 2024
e2fc422
Merge pull request #886 from run-ai/integrations-not-supported
yarongol Jul 30, 2024
c9ce1f1
fix-references
yarongol Jul 31, 2024
457a13b
Merge pull request #887 from run-ai/reference-fixes
yarongol Jul 31, 2024
61be7dc
Merge pull request #890 from run-ai/managing-ai-initiatives
yarongol Jul 31, 2024
8c3830c
Merge pull request #889 from run-ai/system-monitoring
yarongol Aug 1, 2024
3271e3f
Update backend.md
JamieWeider72 Aug 1, 2024
8bda4f5
Update backend.md
JamieWeider72 Aug 1, 2024
5a88b86
Merge pull request #896 from run-ai/JamieWeider72-patch-10
JamieWeider72 Aug 1, 2024
933dbe1
Merge pull request #895 from run-ai/JamieWeider72-patch-9
JamieWeider72 Aug 1, 2024
e16e216
Add files via upload
JamieWeider72 Aug 1, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 5 additions & 3 deletions .github/workflows/automated-publish-docs.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -19,11 +19,13 @@ jobs:
steps:
- name: Checkout code
uses: actions/checkout@v4
with:
fetch-depth: 0

- name: Get all v*.* branches
id: calculate-env
run: |
BRANCHES=$(git branch --list --all | grep -v master | grep 'origin/v*.*' | sed -n -E 's:.*/(v[0-9]+\.[0-9]+).*:\1:p' | sort -Vu)
BRANCHES=$(git branch -r | grep -E '^ *origin/v[0-9]{1,2}\.[0-9]{1,2}$' | sort -Vu | sed 's/origin\///g' | sed 's/ //g')
NEWEST_VERSION=$(printf '%s\n' "${BRANCHES[@]}" | sort -V | tail -n 1)
CURRENT_BRANCH=${GITHUB_REF#refs/heads/}
ALIAS=$CURRENT_BRANCH-alias
Expand All @@ -48,7 +50,6 @@ jobs:
uses: actions/checkout@v4
with:
ref: ${{ needs.env.outputs.CURRENT_BRANCH }}
fetch-depth: 0

- name: setup python
uses: actions/setup-python@v5
Expand Down Expand Up @@ -97,4 +98,5 @@ jobs:
SLACK_MESSAGE_ON_SUCCESS: "Docs were updated successfully for version ${{ needs.env.outputs.TITLE }}"
SLACK_MESSAGE_ON_FAILURE: "Docs update FAILED for version ${{ needs.env.outputs.TITLE }}"
MSG_MINIMAL: true
SLACK_FOOTER: ""
SLACK_FOOTER: ""

1 change: 0 additions & 1 deletion docs/Researcher/Walkthroughs/quickstart-overview.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,6 @@ Follow the Quickstart documents below to learn more:
* [Interactive build sessions with externalized services](walkthrough-build-ports.md)
* [Using GPU Fractions](walkthrough-fractions.md)
* [Distributed Training](walkthrough-distributed-training.md)
* [Hyperparameter Optimization](walkthrough-hpo.md)
* [Over-Quota, Basic Fairness & Bin Packing](walkthrough-overquota.md)
* [Fairness](walkthrough-queue-fairness.md)
* [Inference](quickstart-inference.md)
Expand Down
7 changes: 0 additions & 7 deletions docs/Researcher/best-practices/env-variables.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,13 +13,6 @@ Run:ai provides the following environment variables:
Note that the Job can be deleted and then recreated with the same name. A Job UUID will be different even if the Job names are the same.


## Identifying a Pod

With [Hyperparameter Optimization](../Walkthroughs/walkthrough-hpo.md), experiments are run as _Pods_ within the Job. Run:ai provides the following environment variables to identify the Pod.

* ``POD_INDEX`` - An index number (0, 1, 2, 3....) for a specific Pod within the Job. This is useful for Hyperparameter Optimization to allow easy mapping to individual experiments. The Pod index will remain the same if restarted (due to a failure or preemption). Therefore, it can be used by the Researcher to identify experiments.
* ``POD_UUID`` - a unique identifier for the Pod. if the Pod is restarted, the Pod UUID will change.

## GPU Allocation

Run:ai provides an environment variable, visible inside the container, to help identify the number of GPUs allocated for the container. Use `RUNAI_NUM_OF_GPUS`
Expand Down
8 changes: 0 additions & 8 deletions docs/Researcher/cli-reference/runai-submit.md
Original file line number Diff line number Diff line change
Expand Up @@ -50,14 +50,6 @@ runai submit --name frac05 -i gcr.io/run-ai-demo/quickstart -g 0.5

(see: [GPU fractions Quickstart](../Walkthroughs/walkthrough-fractions.md)).

Hyperparameter Optimization

```console
runai submit --name hpo1 -i gcr.io/run-ai-demo/quickstart-hpo -g 1 \
--parallelism 3 --completions 12 -v /nfs/john/hpo:/hpo
```

(see: [hyperparameter optimization Quickstart](../Walkthroughs/walkthrough-hpo.md)).

Submit a Job without a name (automatically generates a name)

Expand Down
2 changes: 0 additions & 2 deletions docs/Researcher/scheduling/the-runai-scheduler.md
Original file line number Diff line number Diff line change
Expand Up @@ -226,5 +226,3 @@ To search for good hyperparameters, Researchers typically start a series of smal

With HPO, the Researcher provides a single script that is used with multiple, varying, parameters. Each run is a *pod* (see definition above). Unlike Gang Scheduling, with HPO, pods are **independent**. They are scheduled independently, started, and end independently, and if preempted, the other pods are unaffected. The scheduling behavior for individual pods is exactly as described in the Scheduler Details section above for Jobs.
In case node pools are enabled, if the HPO workload has been assigned with more than one node pool, the different pods might end up running on different node pools.

For more information on Hyperparameter Optimization in Run:ai see [here](../Walkthroughs/walkthrough-hpo.md)
90 changes: 90 additions & 0 deletions docs/Researcher/user-interface/workspaces/blocks/Existing PVC.md

Large diffs are not rendered by default.

Binary file added docs/admin/aiinitiatives/img/assigning.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/admin/aiinitiatives/img/bu.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/admin/aiinitiatives/img/groupbyhardware.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/admin/aiinitiatives/img/groupbytopology.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/admin/aiinitiatives/img/individuals.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/admin/aiinitiatives/img/org.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
98 changes: 98 additions & 0 deletions docs/admin/aiinitiatives/overview.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,98 @@
# AI Initiatives

AI initiatives refer to advancing research, development, and implementation of AI technologies. These initiatives represent your business needs and involve collaboration between individuals, teams, and other stakeholders. AI initiatives require compute resources and a methodology to effectively and efficiently use those compute resources and split them among the different AI initiatives stakeholders. The building blocks of AI compute resources are GPUs, CPUs, and CPU memory, which are built into nodes (servers) and can be further grouped into node pools. Nodes and node pools are part of a Kubernetes Cluster.

To manage AI initiatives in Run:ai you should:

* Map your organization and initiatives to projects and optionally departments
* Map compute resources (node pools and quotas) to projects and optionally departments
* Assign users (e.g. AI practitioners, ML engineers, Admins) to projects and departments

## Mapping your organization

The way you map your AI initiatives and organization into Run:ai projects and departments should reflect your organization’s structure and Project management practices. There are multiple options, and we provide you here with 3 examples of typical forms in which to map your organization, initiatives, and users into Run:ai, but of course, other ways that suit your requirements are also acceptable.

### Based on individuals

A typical use case would be students (individual practitioners) within a faculty (business unit) - an individual practitioner may be involved in one or more initiatives. In this example, the resources are accounted for by the student (project) and aggregated per faculty (department).
Department = business unit / Project = individual practitioner

![](img/individuals.png)

### Based on business units

A typical use case would be an AI service (business unit) split into AI capabilities (initiatives) - an individual practitioner may be involved in several initiatives. In this example, the resources are accounted for by Initiative (project) and aggregated per AI service (department).

Department = business unit / Project = initiative

![](img/bu.png)

### Based on the organizational structure

A typical use case would be a business unit split into teams - an individual practitioner is involved in a single team (project) but the team may be involved in several AI initiatives. In this example, the resources are accounted for by team (project) and aggregated per business unit (department).

Department = business unit / Project = team

![](img/org.png)

## Mapping your resources

AI initiatives require compute resources such as GPUs and CPUs to run. Compute resources in any organization are limited, either due to the number of servers (nodes) owned by the organization is limited, the budget it can spend to lease resources in the cloud or spending for in-house servers is also limited. Every organization strives to optimize the usage of its resources by maximizing their utilization and providing all users with their needs. Therefore, the organization needs to split resources according to the organization's internal priorities and budget constraints. But even after splitting the resources, the orchestration layer should still provide fairness between the resourced consumers, and allow access to unused resources to minimize scenarios of idle resources.

Another aspect of resource management is how to group your resources effectively, especially in large environments, or environments that are made of heterogeneous types of hardware, where some users need to use specific hardware types, or where other users should avoid occupying critical hardware of some users or initiatives.

Run:ai assists you with all of these complex issues by allowing you to map your cluster resources to node pools, then map each Project and Department a quota allocation per node pool, and set access rights to unused resources (Over quota) per node pool.

### Grouping your resources

There are several reasons why you would group resources (nodes) into node pools:

* **Control the GPU type to use in heterogeneous hardware environment** - in many cases, AI models can be optimized per hardware type they will use, e.g. a training workload that is optimized for H100 does not necessarily run optimally on an A100, and vice versa. Therefore segmenting into node pools, each with a different hardware type gives the AI researcher and ML engineer better control of where to run.
* **Quota control** - splitting to node pools allows the admin to set specific quota per hardware type, e.g. give high priority project guaranteed access to advanced GPU hardware, while keeping lower priority project with a lower quota or even with no quota at all for that high-end GPU, but give it a “best-effort” access only (i.e. if the high priority guaranteed project is not using those resources).
* **Multi-region or multi-availability-zone cloud environments** - if some or all of your clusters run on the cloud (or even on-premise) but any of your clusters uses different physical locations or different topologies (e.g. racks), you probably want to segment your resources per region/zone/topology to be able to control where to run your workloads, how much quota to assign to specific environments (per project, per department), even if all those locations are all using the same hardware type. This methodology can help in optimizing the performance of your workloads because of the superior performance of local computing such as the locality of distributed workloads, local storage etc.
* **Explainability and predictability** - large environments are complex to understand, this becomes even more complex when an environment is loaded. To maintain users’ satisfaction and their understanding of the resources state, as well as to keep predictability of your workload chances to get scheduled, segmenting your cluster into smaller pools may significantly help.
* **Scale** - Run:ai implementation of node pools has many benefits, one of the main of them is scale. Each node pool has its own scheduler instance, therefore allowing the cluster to handle more nodes and schedule workloads faster when segmented into node pools vs. one large cluster. To allow your workloads to use any resource within a cluster that is split to node pools, a second-level Scheduler is in charge of scheduling workloads between node pools according to your preferences and resource availability.
* **Prevent mutual exclusion** - Some AI workloads consume CPU-only resources, to prevent those workloads from consuming the CPU resources of GPU nodes and thus block GPU workloads from using those nodes, it is recommended to group CPU-only nodes into a dedicated node pool(s) and assign a quota for CPU projects to CPU node-pools only while keeping GPU node-pools with zero quota and optionally “best-effort” over-quota access for CPU-only projects.

#### Grouping Examples

Set out below are illustrations of different grouping options.

Example: grouping nodes by topology

![](img/groupbytopology.png)


Example: grouping nodes by hardware type

![](img/groupbyhardware.png)

### Assigning your resources

After the initial grouping of resources, it is time to associate resources to AI initiatives, this is performed by assigning quotas to projects and optionally to departments. Assigning GPU quota to a project, on a node pool basis, means that the workloads submitted by that project are entitled to use those GPUs as guaranteed resources and can use them for all workload types.

However, what happens if the project requires more resources than its quota? This depends on the type of workloads that the user wants to submit. If the user requires more resources for non-preemptible workloads, then the quota must be increased, because non-preemptible workloads require guaranteed resources. On the other hand, if the type of workload is, for example, a model Training workload that is preemptible - in this case the project can exploit unused resources of other projects, as long as the other projects don’t need them. Over-quota is set per project on a node-pool basis and per department.

Administrators can use quota allocations to prioritize resources between users, teams, and AI initiatives. The administrator can completely prevent the use of certain node pools by a project or department by setting the node pool quota to 0 and disabling over quota for that node pool, or it can keep the quota to 0 and enable over-quota to that node pool and allow access based on resource availability only (e.g. unused GPUs). However, when a project with a non-zero quota needs to use those resources, the Scheduler reclaims those resources back and preempts the preemptible workloads of over-quota projects. As an administrator, you can also have an impact on the amount of over-quota resources a project or department uses.

It is essential to make sure that the sum of all projects' quota does NOT surpass that of the Department, and that the sum of all departments does not surpass the number of physical resources, per node pool and for the entire cluster (we call such behavior - ‘over-subscription’). The reason over-subscription is not recommended is that it may produce unexpected scheduling decisions, especially those that might preempt ‘non-preemptive’ workloads or fail to schedule workloads within quota, either non-preemptible or preemptible, thus quota cannot be considered anymore as ‘guaranteed’. Admins can opt-in a system flag that helps to prevent over-subscription scenarios.

Example: assigning resources to projects

![](img/assigning.png)



## Assigning users to projects and departments

Run:ai system is using ‘Role Based Access Control’ (RBAC) to manage users’ access rights to the different objects of the system, its resources, and the set of allowed actions.
To allow AI researchers, ML engineers, Project Admins, or any other stakeholder of your AI initiatives to access projects and use AI compute resources with their AI initiatives, the administrator needs to assign users to projects. After a user is assigned to a project with the proper role, e.g. ‘L1 Researcher’, the user can submit and monitor its workloads under that project. Assigning users to departments is usually done to assign ‘Department Admin’ to manage a specific department. Other roles, such as ‘L1 Researcher’, can also be assigned to departments, this allows the researcher access to all projects within that department.

## Submitting workloads

Now that resources are grouped into node pools, organizational units or business initiatives are mapped into projects and departments, projects’ quota parameters are set per node pool, and users are assigned to projects, you can finally submit workloads from a project and use compute resources to run your AI initiatives.

When a workload is submitted, it goes to the chosen Kubernetes cluster, and the Run:ai Scheduler handles it.

The Scheduler’s main role is to find the best-suited node or nodes for each submitted workload, so that those nodes match the resources and other characteristics requested by the workload while adhering to the quota and fairness principles of the Run:ai system. A workload can be a single pod running on a single node, or a distributed workload using multiple pods, each running on a node (or part of a node). It is not rare to find large training workloads using 128 nodes and even more, or inference workloads using multiple pods and nodes. There are numerous types of workloads, some are Kubernetes native and some are 3rd party extensions on top of Kubernetes native pods. The Run:ai Scheduler schedules any Kubernetes native workloads, Run:ai workloads, or any type of 3rd party workload.

Loading
Loading