You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The Run:ai Administration User Interface provides a set of dashboards that help you monitor Clusters, Cluster Nodes, Projects, and Workloads. This document provides the key metrics to monitor, how to assess them as well as suggested actions.
4
4
5
+
Dashboards are used by system administrators to analyze and diagnose issues that relate to:
6
+
7
+
* Physical Resources.
8
+
* Organization resource allocation and utilization.
9
+
* Usage characteristics.
10
+
11
+
System administrators need to know important information about the physical resources that are currently being used. Important information such as:
12
+
13
+
* Resource health.
14
+
* Available resources and their distribution.
15
+
* Is there a lack of resources.
16
+
* Are resources being utilized correctly.
17
+
18
+
With this information, system administrators can hone in on:
19
+
20
+
* How resources are allocated across the organization.
21
+
* How the different organizational units utilized quotas and resources within those quotas.
22
+
* The actual performance of the organizational units.
23
+
24
+
These dashboards give system administrators the ability to drill down to see details of the different types of workloads that each of the organizational units is running. These usage and performance metrics ensure that system administrators can then take actions to correct issues that affect performance.
25
+
5
26
There are 5 dashboards:
6
27
7
-
*[**Overview**](#overview-dashboard) dashboard—Provides information about what is happening right now in the cluster.
28
+
*[**GPU/CPU Overview**](#gpucpu-overview-dashboard) dashboard—Provides information about what is happening right now in the cluster.
8
29
*[**Quota Management**](#quota-management-dashboard) dashboard—Provides information about quota utilization.
9
30
*[**Analytics**](#analytics-dashboard) dashboard—Provides long term analysis of cluster behavior.
10
31
*[**Multi-Cluster Overview**](#multi-cluster-overview-dashboard) dashboard—Provides a more holistic, multi-cluster view of what is happening right now. The dashboard is intended for organizations that have more than one connected cluster.
11
32
*[**Consumption**](#consumption-dashboard) dashboard—Provides information about resource consumption.
12
33
13
-
## Overview Dashboard
14
-
15
-
The Overview dashboard provides information about what is happening **right now** in the cluster. Administrators can view high-level information on the state of the cluster, including:
34
+
## GPU/CPU Overview Dashboard (New and legacy)
16
35
17
-
* The number of available and allocated resources and their cluster-wide utilization.
18
-
* The number of running and pending **Workloads**, their utilization, information on Workloads with errors or Workloads with idle GPUs or CPUs.
19
-
* Active **Projects**, their assigned and allocated GPUs or CPUs and number of running and pending Workloads.
36
+
The Overview dashboard provides information about what is happening **right now** in the cluster. Administrators can view high-level information on the state of the cluster. The dashboard has two tabs that change the display to provide a focused view for [GPU Dashboards](#gpu-dashboard) (default view) and [CPU Dashboards](#cpu-dashboard).
20
37
21
-
The dashboard has two tabs that change the display to provide a focused view for [GPU Dashboards](#gpu-dashboard) (default view) and [CPU Dashboards](#cpu-dashboard).
38
+
### GPU Dashboard
22
39
23
-
The dashboard has a dropdown filter for node pools. From the dropdown, select one or more node pools. The default setting is `all`.
40
+
The GPU dashboard displays specific information for GPU based nodes, node-pools, clusters, or tenants. These dashboards also include additional metrics that specific to GPU based environments. The dashboard contains tiles that show information about specific resource allocation and performance metrics. The tiles are interactive allowing you to link directly to the assets or drill down to specific scopes. Use the time frame selector to choose a time frame for all the tiles in the dashboard.
24
41
25
-
Cluster administrators can use the Overview dashboard to find issues and fix them. Below are a few examples:
42
+
The dashboard has the following tiles:
26
43
27
-
### GPU Dashboard
28
-
29
-
The GPU dashboard displays specific information for GPU based nodes, node-pools, clusters, or tenants. These dashboards also include additional metrics that specific to GPU based environments.
44
+
* Ready nodes—displays GPU nodes that are in the ready state.
45
+
* Ready GPU devices—displays the number of GPUs in nodes that are in the ready state.
46
+
* Allocated GPU compute—displays the total number of GPUs allocated from all the nodes.
47
+
* Running workloads—displays the number of running workloads.
48
+
* Pending workloads—displays the number of workloads in the pending status.
49
+
* Allocation ration by node pool—displays the percentage of GPUs allocated per node pool. Hover over the bar for detailed information. Use the scope selected at the bottom of the graph to drill down for more details.
50
+
* Free resources by node pool—the graph displays the amount of free resources per node pool. Press a entry in the graph for more details. Hover over the resource bubbles for specific details for the workers in the node. Use the ellipsis to download the graph as a CSV file.
51
+
* Resource allocation by workload type—displays the resource allocation by workload type. Hover over the bar for detailed information. Use the scope selected at the bottom of the graph to drill down for more details. Use the ellipsis to download the graph as a CSV file.
52
+
* Workload by status—displays the number of workloads for each status in the workloads table. Hover over the bar for detailed information. Use the scope selected at the bottom of the graph to drill down for more details. Use the ellipsis to download the graph as a CSV file.
53
+
* Resources utilization—displays the resource utilization over time. The right pane of the graph shows the average utilization of the selected time frame of the dashboard. Hover over the graph to see details of a specific time in the graph. Use the ellipsis to download the graph as a CSV file.
54
+
* Resource allocation—displays the resource allocation over time. The right pane of the graph shows the average allocation of the selected time frame of the dashboard. Hover over the graph to see details of a specific time in the graph. Use the ellipsis to download the graph as a CSV file.
30
55
31
56
### CPU Dashboard
32
57
33
58
The CPU dashboards display specific information for CPU based nodes, node-pools, clusters, or tenants. These dashboards also include additional metrics that specific to CPU based environments.
34
59
35
60
To enable CPU Dashboards:
36
61
37
-
1. Press the `Settings` icon, then press `General`
38
-
2.Toggle the *Show CPU dashboard* switch to enable the feature.
62
+
1. Press the `Tools & Settings` icon, then press `General`.
63
+
2.Open the `Analytics` pane and toggle the *Show CPU dashboard* switch to enable the feature.
39
64
40
65
Toggle the switch to `disable` to disable *CPU Dashboards* option.
41
66
42
-
The following analysis can apply to both GPU and CPU dashboards.
43
-
44
-
### Total and Ready GPU or CPU Nodes
45
-
46
-
The *Indicators* panel of the *GPU Overview Dashboard* displays the total number of GPU nodes, the number of ready GPU nodes, the total number of GPUs, and the total number of ready GPUs.
47
-
48
-
The *Indicators* panel of the *CPU Overview Dashboard* displays the total number of CPU nodes, the number of ready CPU nodes, the total number of CPUs, and the total number of ready CPUs.
49
-
50
-
These panes help calculate the number of available (unscheduled) resources in the platform.
51
-
52
-
***Total GPU/CPU Nodes**—indicates the sum total of nodes in all clusters connected to the platform.
53
-
***Ready GPU/CPU Nodes**—indicates the number of nodes that are available to the scheduler. This is calculated by subtracting the number of unscheduled nodes from the total number of nodes.
54
-
***Total GPUs/CPUs**—indicates ihe total number of GPUs/CPUs in all the clusters that are connected to the platform.
55
-
***Ready GPUs/CPUs**—indicates the number of GPUs or CPUs that are available to work with the scheduler. This is calculated by subtracting the number of unscheduled GPUs or CPUs from the total number of GPUs or CPUs.
56
-
57
-
The *Free GPUs* graph displays the number of free GPUs or CPUs on each node.
58
-
59
-
### Workloads with idle GPUs or CPUs
60
-
61
-
Locate workloads with idle GPUs or CPUs, defined as GPUs/CPUs with 0% utilization for more than 5 minutes.
62
-
<!--
63
-
**How to**: view the following panel:
67
+
The dashboard contains the following tiles:
68
+
69
+
* Total CPU Nodes—displays the total amount of CPU nodes.
70
+
* Ready CPU nodes—displays the total amount of CPU nodes in the ready state.
71
+
* Total CPUs—displays the total amount of CPUs.
72
+
* Ready CPUs—displays the total amount of CPUs in the ready state.
73
+
* Allocated CPUs—displays the amount of allocated CPUs.
74
+
* Running workloads—displays the amount of workloads in the running state.
75
+
* Pending workloads—displays the amount of workloads in the pending state.
76
+
* Allocated CPUs per project—displays the amount of CPUs allocated per project.
77
+
* Active projects—displays the active projects with the CPU allocation and amount of running and pending workloads.
78
+
* Utilization per resource type—displays the CPU compute and CPU memory utilization over time.
79
+
* CPU compute utilization—displays the current CPU compute utilization.
80
+
* CPU memory utilization—displays the current CPU memory utilization.
81
+
* Pending workloads—displays the requested resources and wait time for workloads in the pending status.
82
+
* Workloads with error—displays the amount of workloads that are currently not running due to an error.
83
+
* Workload Count per CPU Compute Utilization—
84
+
* 5 longest running workloads—displays up to 5 of workloads that have the longest running time.
64
85
65
-

66
-
-->
67
86
**Analysis and Suggested actions**:
68
87
69
88
| Review | Analysis & Actions |
70
89
|---------|---------------------|
71
-
| Interactive Workloads are too frequently idle | *Consider setting time limits for interactive Workloads through the Projects tab. <br>* Consider also reducing GPU/CPU quotas for specific Projects to encourage users to run more training Workloads as opposed to interactive Workloads (note that interactive Workloads can not use more than the GPU/CPU quota assigned to their Project). |
90
+
| Interactive Workloads are too frequently idle |*Consider setting time limits for interactive Workloads through the Projects tab. <br>* Consider also reducing GPU/CPU quotas for specific Projects to encourage users to run more training Workloads as opposed to interactive Workloads (note that interactive Workloads can not use more than the GPU/CPU quota assigned to their Project). |
72
91
| Training Workloads are too frequently idle | Identify and notify the right users and work with them to improve the utilization of their training scripts |
73
92
74
93
### Workloads with an Error
75
94
76
95
Search for Workloads with an error status. These Workloads may be holding GPUs/CPUs without actually using them.
77
-
<!--
78
-
**How to**: view the following panel:
79
96
80
-

81
-
-->
82
97
**Analysis and Suggested actions**:
83
98
84
99
Search for workloads with an Error status on the Workloads view and discuss with the Job owner. Consider deleting these Workloads to free up the resources for other users.
85
100
86
101
### Workloads with a Long Duration
87
102
88
103
View list of 5 longest Workloads.
89
-
<!--
90
-
**How to**: view the following panel:
91
104
92
-

93
-
-->
94
105
**Analysis and Suggested actions**:
95
106
96
107
| Review | Analysis & Actions |
@@ -101,11 +112,7 @@ View list of 5 longest Workloads.
Also, check the command that the user used to submit the job. The Researcher may have requested a specific Node for that Job.
117
124
118
-
## Quota management dashboard
119
-
120
-
The Quota management dashboard provides an efficient means to monitor and manage resource utilization within the AI cluster. The dashboard is divided into sections with essential metrics and data visualizations to identify resource usage patterns, potential bottlenecks, and areas for optimization. The sections of the dashboard include:
121
-
122
-
***Add Filter**
123
-
***Quota / Total**
124
-
***Allocated / Quota**
125
-
***Pending workloads**
126
-
***Quota by node pool**
127
-
***Allocation by node pool**
128
-
***Pending workloads by node pool**
129
-
***Departments with lowest allocation by node pool**
130
-
***Projects with lowest allocation ratio by node pool**
131
-
***Over time allocation / quota**
132
-
133
-
### Add Filter
134
-
135
-
Use the *Add Filter* dropdown to select filters for the dashboard. The filters will change the data shown on the dashboard. Available filters are:
136
-
137
-
* Departments
138
-
* Projects
139
-
* Nodes
140
-
141
-
Select a filter from the dropdown, then select a item from the list, and press apply.
142
-
143
-
!!! Note
144
-
You can create a filter with multiple categories, but you can use each category and item only once.
145
-
146
-
### Quota / Total
147
-
148
-
This section shows the number of GPUs that are in the quota based on the filter selection. The quota of GPUs is the number of GPUs that are reserved for use.
149
-
150
-
### Allocated / Quota
151
-
152
-
This section shows the number of GPUs that are allocated based on the filter selection. Allocated GPUs are the number of GPUs that are being used.
153
-
154
-
### Pending workloads
155
-
156
-
This section shows the number workloads that are pending based on the filter selection. Pending workloads are workloads that have not started.
157
-
158
-
### Quota by node pool
159
-
160
-
This section shows the quota of GPUs by node pool based on the filter. The quota is the number of GPUs that are reserved for use. You can drill down into the data in this section by pressing on the graph or the link at the bottom of the section.
161
-
162
-
### Allocation by node pool
163
-
164
-
This section shows the allocation of GPUs by node pool based on the filter. The allocation is the number of GPUs that are being used. You can drill down into the data in this section by pressing on the graph or the link at the bottom of the section.
165
-
166
-
### Pending workloads by node pool
167
-
168
-
This section shows the number of pending workloads by node pool. You can drill down into the data in this section by pressing on the graph or the link at the bottom of the section.
169
-
170
-
### Departments with lowest allocation by node pool
171
-
172
-
This section shows the departments with the lowest allocation of GPUs by percentage relative to the total number of GPUs.
173
-
174
-
### Projects with lowest allocation ratio by node pool
175
-
176
-
This section shows the projects with the lowest allocation of GPUS by percentage relative to the total number of GPUs.
177
-
178
-
### Over time allocation / quota
179
-
180
-
This section shows the allocation of GPUs from the quota over a period of time.
181
-
182
125
## Analytics Dashboard
183
126
184
127
The Analytics dashboard provides means for viewing historical data on cluster information such as:
@@ -293,9 +236,9 @@ Provides a holistic, aggregated view across Clusters, including information abou
293
236
294
237
This dashboard enables users and admins to view consumption usage using run:AI services. The dashboard provides views based on configurable filters and timelines. The dashboard also provides costing analysis for GPU, CPU, and memory costs for the system.
The Quota management dashboard provides an efficient means to monitor and manage resource utilization within the AI cluster. The dashboard is divided into sections with essential metrics and data visualizations to identify resource usage patterns, potential bottlenecks, and areas for optimization. The sections of the dashboard include:
315
+
316
+
***Add Filter**
317
+
***Quota / Total**
318
+
***Allocated / Quota**
319
+
***Pending workloads**
320
+
***Quota by node pool**
321
+
***Allocation by node pool**
322
+
***Pending workloads by node pool**
323
+
***Departments with lowest allocation by node pool**
324
+
***Projects with lowest allocation ratio by node pool**
325
+
***Over time allocation / quota**
326
+
327
+
### Add Filter
328
+
329
+
Use the *Add Filter* dropdown to select filters for the dashboard. The filters will change the data shown on the dashboard. Available filters are:
330
+
331
+
* Departments
332
+
* Projects
333
+
* Nodes
334
+
335
+
Select a filter from the dropdown, then select a item from the list, and press apply.
336
+
337
+
!!! Note
338
+
You can create a filter with multiple categories, but you can use each category and item only once.
339
+
340
+
### Quota / Total
341
+
342
+
This section shows the number of GPUs that are in the quota based on the filter selection. The quota of GPUs is the number of GPUs that are reserved for use.
343
+
344
+
### Allocated / Quota
345
+
346
+
This section shows the number of GPUs that are allocated based on the filter selection. Allocated GPUs are the number of GPUs that are being used.
347
+
348
+
### Pending workloads
349
+
350
+
This section shows the number workloads that are pending based on the filter selection. Pending workloads are workloads that have not started.
351
+
352
+
### Quota by node pool
353
+
354
+
This section shows the quota of GPUs by node pool based on the filter. The quota is the number of GPUs that are reserved for use. You can drill down into the data in this section by pressing on the graph or the link at the bottom of the section.
355
+
356
+
### Allocation by node pool
357
+
358
+
This section shows the allocation of GPUs by node pool based on the filter. The allocation is the number of GPUs that are being used. You can drill down into the data in this section by pressing on the graph or the link at the bottom of the section.
359
+
360
+
### Pending workloads by node pool
361
+
362
+
This section shows the number of pending workloads by node pool. You can drill down into the data in this section by pressing on the graph or the link at the bottom of the section.
363
+
364
+
### Departments with lowest allocation by node pool
365
+
366
+
This section shows the departments with the lowest allocation of GPUs by percentage relative to the total number of GPUs.
367
+
368
+
### Projects with lowest allocation ratio by node pool
369
+
370
+
This section shows the projects with the lowest allocation of GPUS by percentage relative to the total number of GPUs.
371
+
372
+
### Over time allocation / quota
373
+
374
+
This section shows the allocation of GPUs from the quota over a period of time.
0 commit comments