Skip to content

Commit f02298b

Browse files
committed
Merge pull request #909 from run-ai/project-scheduling-rules
scheduling-rules
1 parent 8520353 commit f02298b

File tree

8 files changed

+66
-17
lines changed

8 files changed

+66
-17
lines changed

docs/Researcher/scheduling/the-runai-scheduler.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -104,7 +104,7 @@ The Run:ai scheduler wakes up periodically to perform allocation tasks on pendin
104104
A *Node Pool* is a set of nodes grouped by an Administrator into a distinct group of resources from which resources can be allocated to Projects and Departments.
105105
By default, any node pool created in the system is automatically associated with all Projects and Departments using zero quota resource (GPUs, CPUs, Memory) allocation. This allows any Project and Department to use any node pool with Over-Quota (for Preemptible workloads), thus maximizing the system resource utilization.
106106

107-
* An Administrator can allocate resources from a specific node pool to chosen Projects and Departments. See [Project Setup](../../admin/admin-ui-setup/project-setup.md#limit-jobs-to-run-on-specific-node-groups)
107+
* An Administrator can allocate resources from a specific node pool to chosen Projects and Departments. See [Project Scheduling Rules](../../admin/aiinitiatives/org/scheduling-rules.md)
108108
* The Researcher can use node pools in two ways. The first one is where a Project has guaranteed resources on node pools - The Researcher can then submit a workload and specify a single node pool or a prioritized list of node pools to use and receive guaranteed resources.
109109
The second is by using node-pool(s) with no guaranteed resource for that Project (zero allocated resources), and in practice using Over-Quota resources of node-pools. This means a Workload must be Preemptible as it uses resources out of the Project or node pool quota. The same scenario occurs if a Researcher uses more resources than allocated to a specific node pool and goes Over-Quota.
110110
* By default, if a Researcher doesn't specify a node-pool to use by a workload, the scheduler assigns the workload to run using the Project's 'Default node-pool list'.
@@ -113,7 +113,7 @@ The second is by using node-pool(s) with no guaranteed resource for that Project
113113

114114
Both the Administrator and the Researcher can provide limitations as to which nodes can be selected for the Job. Limits are managed via [Kubernetes labels](https://kubernetes.io/docs/concepts/overview/working-with-objects/labels/){target=_blank}:
115115

116-
* The Administrator can set limits at the Project level. Example: Project `team-a` can only run `interactive` Jobs on machines with a label of `v-100` or `a-100`. See [Project Setup](../../admin/admin-ui-setup/project-setup.md#limit-jobs-to-run-on-specific-node-groups) for more information.
116+
* The Administrator can set limits at the Project level. Example: Project `team-a` can only run `interactive` Jobs on machines with a label of `v-100` or `a-100`. See [Project Scheduling Rules](../../admin/aiinitiatives/org/scheduling-rules.md) for more information.
117117
* The Researcher can set a limit at the Job level, by using the command-line interface flag `--node-type`. The flag acts as a subset to the Project setting.
118118

119119
Node affinity constraints are used during the *Allocation* phase to filter out candidate nodes for running the Job. For more information on how nodes are filtered see the `Filtering` section under [Node selection in kube-scheduler](https://kubernetes.io/docs/concepts/scheduling-eviction/kube-scheduler/#kube-scheduler-implementation){target=_blank}. The Run:ai scheduler works similarly.

docs/Researcher/user-interface/workspaces/overview.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -30,7 +30,7 @@ When the workspace is active it exposes the connections to the tools (for exampl
3030
![](img/2-connecting-to-tools.png)
3131

3232

33-
An active workspace is a Run:ai [interactive workload](../../../admin/workloads/workload-overview-admin.md). The interactive workload starts when the workspace is started and stopped when the workspace is stopped.
33+
An active workspace is a Run:ai [interactive workload](../../../admin/workloads/submitting-workloads.md). The interactive workload starts when the workspace is started and stops when the workspace is stopped.
3434

3535

3636
Workspaces can be used via the user interface or programmatically via the Run:ai [Admin API](../../../developer/admin-rest-api/overview.md). Workspaces are not supported via the command line interface. You can still run an interactive workload via the command line.

docs/Researcher/user-interface/workspaces/statuses.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -25,7 +25,7 @@ The *Initializing* status indicates that the workspace has been scheduled and is
2525
The *Active* status indicates that the workspace is ready to be used and allows the researcher to connect to its tools. At this status, the workspace is consuming resources and affecting the project’s quota. The workspace will turn to active status once the `Active` button is pressed, the activation process ends up successfully and relevant resources are available and vacant.
2626

2727
## Stopped workspace
28-
The *Stopped* status indicates that the workspace is currently unused and does not consume any resources. A workspace can be stopped either manually, or automatically if triggered by idleness criteria set by the admin (see [Limit duration of interactive Jobs](../../../admin/admin-ui-setup/project-setup.md#limit-duration-of-interactive-and-training-jobs)).
28+
The *Stopped* status indicates that the workspace is currently unused and does not consume any resources. A workspace can be stopped either manually, or automatically if triggered by idleness criteria set by the admin (see [Limit duration of interactive Jobs](../../../admin/aiinitiatives/org/scheduling-rules.md)).
2929

3030
## Failed workspace
3131

Lines changed: 48 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,48 @@
1+
This article explains the procedure of configuring and managing Scheduling rules. Scheduling rules refer to restrictions applied over workloads. These restrictions apply to either the resources (nodes) on which workloads can run or to the duration of the workload run time. Scheduling rules are set for Projects and apply to a specific workload type. Once scheduling rules are set for a project, all matching workloads associated with the project will have the restrictions as defined when the workload was submitted. New scheduling rules added to a project are not applied over already created workloads associated with that project.
2+
3+
There are 3 types of rules:
4+
5+
* **Workload time limit** - This rule limits the duration of a workload run time. Workload run time is calculated as the total time in which the workload was in status “Running“.
6+
* **Idle GPU time limit** - This rule limits the total GPU time of a workload. Workload idle time is counted since the first time the workload was in status “Running“ and the GPU was idle.
7+
For fractional workloads, workloads running on a MIG slice, multi GPU or multi-node workloads, each GPU idle second is calculated as follows: __<requires explanation about how it is calculated__
8+
* **Node type (Affinity)** - This rule limits a workload to run on specific node types. node type is a node affinity applied on the node. Run:ai labels the nodes with the appropriate affinity and indicates the scheduler where it is allowed to schedule the workload.
9+
10+
Adding a scheduling rule to a project
11+
12+
To add a scheduling rule:
13+
14+
1. Select the project you want to add a scheduling rule for
15+
2. Click **EDIT**
16+
3. In the **Scheduling rules** section click **\+RULE**
17+
4. Select the **rule type**
18+
5. Select the **workload type** and **time limitation period**
19+
6. For Node type, choose one or more labels for the desired nodes.
20+
7. Click **SAVE**
21+
22+
!!! Note
23+
You can review the defined rules in the Projects table in the relevant column.
24+
25+
## Editing the project’s scheduling rule
26+
27+
To edit a scheduling rule:
28+
29+
1. Select the project you want to edit its scheduling rule
30+
2. Click **EDIT**
31+
3. Find the scheduling rule you would like to edit
32+
4. Edit the rule
33+
5. Click **SAVE**
34+
35+
## Deleting the project’s scheduling rule
36+
37+
To delete a scheduling rule:
38+
39+
1. Select the project you want to delete a scheduling rule from
40+
2. Click **EDIT**
41+
3. Find the scheduling rule you would like to delete
42+
4. Click on the x icon
43+
5. Click **SAVE**
44+
45+
## Using API
46+
47+
Go to the [Projects](https://app.run.ai/api/docs#tag/Projects/operation/create_project) API reference to view the available actions
48+

docs/home/whats-new-2-13.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -88,7 +88,7 @@ The association between workspaces and node pools is done using *Compute resourc
8888

8989
**Time limit duration**
9090

91-
* Improved the behavior of any workload time limit (for example, *Idle time limit*) so that the time limit will affect existing workloads that were created before the time limit was configured. This is an optional feature which provides help in handling situations where researchers leave sessions open even when they do not need to access the resources. For more information, see [Limit duration of interactive training jobs](../admin/admin-ui-setup/project-setup.md#limit-duration-of-interactive-and-training-jobs).
91+
* Improved the behavior of any workload time limit (for example, *Idle time limit*) so that the time limit will affect existing workloads that were created before the time limit was configured. This is an optional feature which provides help in handling situations where researchers leave sessions open even when they do not need to access the resources. For more information, see [Limit duration of interactive training jobs](#).
9292

9393
* Improved workspaces time limits. Workspaces that reach a time limit will now transition to a state of `stopped` so that they can be reactivated later.
9494

docs/home/whats-new-2-15.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -34,7 +34,7 @@ date: 2023-Dec-3
3434
* Improved filters and search
3535
* More information
3636

37-
Use the toggle at the top of the *Jobs* page to switch to the *Workloads* view. For more information, see [Workloads](../admin/workloads/workload-overview-admin.md#workloads-view).
37+
Use the toggle at the top of the *Jobs* page to switch to the *Workloads* view. For more information.
3838

3939
* <!-- RUN-10639/RUN-11389 - Researcher Service Refactoring RUN-12505/RUN-12506 - Support Kubeflow notebooks for scheduling/orchestration -->Improved support for Kubeflow Notebooks. Run:ai now supports the scheduling of Kubeflow notebooks with fractional GPUs. Kubeflow notebooks are identified automatically and appear with a dedicated icon in the *Jobs* UI.
4040
* <!-- RUN-11292/RUN-11592 General changes in favor of any asset based workload \(WS, training, DT\)-->Improved the *Trainings* and *Workspaces* forms. Now the runtime field for *Command* and *Arguments* can be edited directly in the new *Workspace* or *Training* creation form.

graveyard/whats-new-2022.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,7 @@
2121
The command-line interface utility for version 2.3 is not compatible with a cluster version of 2.5 or later. If you upgrade the cluster, you must also upgrade the command-line interface.
2222
* __Inference__. Run:ai inference offering has been overhauled with the ability to submit deployments via the user interface and a new and consistent API. For more information see [Inference overview](../admin/workloads/inference-overview.md). To enable the new inference module call by Run:ai customer support.
2323
* __CPU and CPU memory quotas__ can now be configured for projects and departments. These are hard quotas which means that the total amount of the requested resource for all workloads associated with a project/department cannot exceed the set limit. To enable this feature please call Run:ai customer support.
24-
* __Workloads__. We have revamped the way Run:ai submits Jobs. Run:ai now submits [Workloads](../admin/workloads/workload-overview-admin.md). The change includes:
24+
* __Workloads__. We have revamped the way Run:ai submits Jobs. Run:ai now submits [Workloads](../admin/workloads/submitting-workloads.md). The change includes:
2525
* New [Cluster API](../developer/cluster-api/workload-overview-dev.md). The older [API](../developer/deprecated/researcher-rest-api/overview.md) has been deprecated and remains for backward compatibility. The API creates all the resources required for the run, including volumes, services, and the like. It also deletes all resources when the workload itself is deleted.
2626
* Administrative templates have been replaced with [Policies](../admin/workloads/policies.md). Policies apply across all ways to submit jobs: command-line, API, and user interface.
2727
* `runai delete` has been changed in favor of `runai delete job`

mkdocs.yml

Lines changed: 11 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -202,21 +202,12 @@ nav:
202202
- 'Setup cluster wide PVC' : 'admin/researcher-setup/cluster-wide-pvc.md'
203203
- 'Group Nodes' : 'admin/researcher-setup/limit-to-node-group.md'
204204
# - 'Messaging setup' : 'admin/researcher-setup/email-messaging.md'
205-
- 'Workloads' :
206-
- 'admin/workloads/README.md'
207-
- 'Policies' :
208-
- 'admin/workloads/policies/README.md'
209-
- 'Former Policies' : 'admin/workloads/policies/policies.md'
210-
- 'Training Policy' : 'admin/workloads/policies/training-policy.md'
211-
- 'Workspaces Policy' : 'admin/workloads/policies/workspaces-policy.md'
212-
- 'Secrets' : 'admin/workloads/secrets.md'
213-
- 'Inference' : 'admin/workloads/inference-overview.md'
214-
- 'Submitting Workloads' : 'admin/workloads/submitting-workloads.md'
215205
- 'Managing AI Intiatives' :
216206
- 'Overview' : 'admin/aiinitiatives/overview.md'
217207
- 'Managing your Organization' :
218208
- 'Projects' : 'admin/aiinitiatives/org/projects.md'
219209
- 'Departments' : 'admin/aiinitiatives/org/departments.md'
210+
- 'Scheduling Rules' : 'admin/aiinitiatives/org/scheduling-rules.md'
220211
# - 'Managing your resources' :
221212
# - 'Nodes' : 'admin/aiinitiatives/resources/nodes.md'
222213
# - 'Node Pools' : 'admin/aiinitiatives/resources/node-pools.md'
@@ -229,6 +220,16 @@ nav:
229220
- 'Jobs' : 'admin/admin-ui-setup/jobs.md'
230221
- 'Credentials' : 'admin/admin-ui-setup/credentials-setup.md'
231222
- 'Templates': 'admin/admin-ui-setup/templates.md'
223+
- 'Workloads' :
224+
- 'admin/workloads/README.md'
225+
- 'Policies' :
226+
- 'admin/workloads/policies/README.md'
227+
- 'Former Policies' : 'admin/workloads/policies/policies.md'
228+
- 'Training Policy' : 'admin/workloads/policies/training-policy.md'
229+
- 'Workspaces Policy' : 'admin/workloads/policies/workspaces-policy.md'
230+
- 'Secrets' : 'admin/workloads/secrets.md'
231+
- 'Inference' : 'admin/workloads/inference-overview.md'
232+
- 'Submitting Workloads' : 'admin/workloads/submitting-workloads.md'
232233
- 'Troubleshooting' :
233234
- 'Cluster Health' : 'admin/troubleshooting/cluster-health-check.md'
234235
- 'Troubleshooting' : 'admin/troubleshooting/troubleshooting.md'

0 commit comments

Comments
 (0)