Skip to content

Commit de27349

Browse files
Merge pull request #733 from jasonnovichRunAI/v2.17-RUN-16507-dashboards-mvp
V2.17-RUN-16507-dashboards-mvp
2 parents 9085dbf + 46d2e5c commit de27349

File tree

1 file changed

+132
-122
lines changed

1 file changed

+132
-122
lines changed

docs/admin/admin-ui-setup/dashboard-analysis.md

Lines changed: 132 additions & 122 deletions
Original file line numberDiff line numberDiff line change
@@ -2,95 +2,106 @@
22

33
The Run:ai Administration User Interface provides a set of dashboards that help you monitor Clusters, Cluster Nodes, Projects, and Workloads. This document provides the key metrics to monitor, how to assess them as well as suggested actions.
44

5+
Dashboards are used by system administrators to analyze and diagnose issues that relate to:
6+
7+
* Physical Resources.
8+
* Organization resource allocation and utilization.
9+
* Usage characteristics.
10+
11+
System administrators need to know important information about the physical resources that are currently being used. Important information such as:
12+
13+
* Resource health.
14+
* Available resources and their distribution.
15+
* Is there a lack of resources.
16+
* Are resources being utilized correctly.
17+
18+
With this information, system administrators can hone in on:
19+
20+
* How resources are allocated across the organization.
21+
* How the different organizational units utilized quotas and resources within those quotas.
22+
* The actual performance of the organizational units.
23+
24+
These dashboards give system administrators the ability to drill down to see details of the different types of workloads that each of the organizational units is running. These usage and performance metrics ensure that system administrators can then take actions to correct issues that affect performance.
25+
526
There are 5 dashboards:
627

7-
* [**Overview**](#overview-dashboard) dashboard—Provides information about what is happening right now in the cluster.
28+
* [**GPU/CPU Overview**](#gpucpu-overview-dashboard) dashboard—Provides information about what is happening right now in the cluster.
829
* [**Quota Management**](#quota-management-dashboard) dashboard—Provides information about quota utilization.
930
* [**Analytics**](#analytics-dashboard) dashboard—Provides long term analysis of cluster behavior.
1031
* [**Multi-Cluster Overview**](#multi-cluster-overview-dashboard) dashboard—Provides a more holistic, multi-cluster view of what is happening right now. The dashboard is intended for organizations that have more than one connected cluster.
1132
* [**Consumption**](#consumption-dashboard) dashboard—Provides information about resource consumption.
1233

13-
## Overview Dashboard
14-
15-
The Overview dashboard provides information about what is happening **right now** in the cluster. Administrators can view high-level information on the state of the cluster, including:
34+
## GPU/CPU Overview Dashboard (New and legacy)
1635

17-
* The number of available and allocated resources and their cluster-wide utilization.
18-
* The number of running and pending **Workloads**, their utilization, information on Workloads with errors or Workloads with idle GPUs or CPUs.
19-
* Active **Projects**, their assigned and allocated GPUs or CPUs and number of running and pending Workloads.
36+
The Overview dashboard provides information about what is happening **right now** in the cluster. Administrators can view high-level information on the state of the cluster. The dashboard has two tabs that change the display to provide a focused view for [GPU Dashboards](#gpu-dashboard) (default view) and [CPU Dashboards](#cpu-dashboard).
2037

21-
The dashboard has two tabs that change the display to provide a focused view for [GPU Dashboards](#gpu-dashboard) (default view) and [CPU Dashboards](#cpu-dashboard).
38+
### GPU Dashboard
2239

23-
The dashboard has a dropdown filter for node pools. From the dropdown, select one or more node pools. The default setting is `all`.
40+
The GPU dashboard displays specific information for GPU based nodes, node-pools, clusters, or tenants. These dashboards also include additional metrics that specific to GPU based environments. The dashboard contains tiles that show information about specific resource allocation and performance metrics. The tiles are interactive allowing you to link directly to the assets or drill down to specific scopes. Use the time frame selector to choose a time frame for all the tiles in the dashboard.
2441

25-
Cluster administrators can use the Overview dashboard to find issues and fix them. Below are a few examples:
42+
The dashboard has the following tiles:
2643

27-
### GPU Dashboard
28-
29-
The GPU dashboard displays specific information for GPU based nodes, node-pools, clusters, or tenants. These dashboards also include additional metrics that specific to GPU based environments.
44+
* Ready nodes—displays GPU nodes that are in the ready state.
45+
* Ready GPU devices—displays the number of GPUs in nodes that are in the ready state.
46+
* Allocated GPU compute—displays the total number of GPUs allocated from all the nodes.
47+
* Running workloads—displays the number of running workloads.
48+
* Pending workloads—displays the number of workloads in the pending status.
49+
* Allocation ration by node pool—displays the percentage of GPUs allocated per node pool. Hover over the bar for detailed information. Use the scope selected at the bottom of the graph to drill down for more details.
50+
* Free resources by node pool—the graph displays the amount of free resources per node pool. Press a entry in the graph for more details. Hover over the resource bubbles for specific details for the workers in the node. Use the ellipsis to download the graph as a CSV file.
51+
* Resource allocation by workload type—displays the resource allocation by workload type. Hover over the bar for detailed information. Use the scope selected at the bottom of the graph to drill down for more details. Use the ellipsis to download the graph as a CSV file.
52+
* Workload by status—displays the number of workloads for each status in the workloads table. Hover over the bar for detailed information. Use the scope selected at the bottom of the graph to drill down for more details. Use the ellipsis to download the graph as a CSV file.
53+
* Resources utilization—displays the resource utilization over time. The right pane of the graph shows the average utilization of the selected time frame of the dashboard. Hover over the graph to see details of a specific time in the graph. Use the ellipsis to download the graph as a CSV file.
54+
* Resource allocation—displays the resource allocation over time. The right pane of the graph shows the average allocation of the selected time frame of the dashboard. Hover over the graph to see details of a specific time in the graph. Use the ellipsis to download the graph as a CSV file.
3055

3156
### CPU Dashboard
3257

3358
The CPU dashboards display specific information for CPU based nodes, node-pools, clusters, or tenants. These dashboards also include additional metrics that specific to CPU based environments.
3459

3560
To enable CPU Dashboards:
3661

37-
1. Press the `Settings` icon, then press `General`
38-
2. Toggle the *Show CPU dashboard* switch to enable the feature.
62+
1. Press the `Tools & Settings` icon, then press `General`.
63+
2. Open the `Analytics` pane and toggle the *Show CPU dashboard* switch to enable the feature.
3964

4065
Toggle the switch to `disable` to disable *CPU Dashboards* option.
4166

42-
The following analysis can apply to both GPU and CPU dashboards.
43-
44-
### Total and Ready GPU or CPU Nodes
45-
46-
The *Indicators* panel of the *GPU Overview Dashboard* displays the total number of GPU nodes, the number of ready GPU nodes, the total number of GPUs, and the total number of ready GPUs.
47-
48-
The *Indicators* panel of the *CPU Overview Dashboard* displays the total number of CPU nodes, the number of ready CPU nodes, the total number of CPUs, and the total number of ready CPUs.
49-
50-
These panes help calculate the number of available (unscheduled) resources in the platform.
51-
52-
* **Total GPU/CPU Nodes**—indicates the sum total of nodes in all clusters connected to the platform.
53-
* **Ready GPU/CPU Nodes**—indicates the number of nodes that are available to the scheduler. This is calculated by subtracting the number of unscheduled nodes from the total number of nodes.
54-
* **Total GPUs/CPUs**—indicates ihe total number of GPUs/CPUs in all the clusters that are connected to the platform.
55-
* **Ready GPUs/CPUs**—indicates the number of GPUs or CPUs that are available to work with the scheduler. This is calculated by subtracting the number of unscheduled GPUs or CPUs from the total number of GPUs or CPUs.
56-
57-
The *Free GPUs* graph displays the number of free GPUs or CPUs on each node.
58-
59-
### Workloads with idle GPUs or CPUs
60-
61-
Locate workloads with idle GPUs or CPUs, defined as GPUs/CPUs with 0% utilization for more than 5 minutes.
62-
<!--
63-
**How to**: view the following panel:
67+
The dashboard contains the following tiles:
68+
69+
* Total CPU Nodes&mdash;displays the total amount of CPU nodes.
70+
* Ready CPU nodes&mdash;displays the total amount of CPU nodes in the ready state.
71+
* Total CPUs&mdash;displays the total amount of CPUs.
72+
* Ready CPUs&mdash;displays the total amount of CPUs in the ready state.
73+
* Allocated CPUs&mdash;displays the amount of allocated CPUs.
74+
* Running workloads&mdash;displays the amount of workloads in the running state.
75+
* Pending workloads&mdash;displays the amount of workloads in the pending state.
76+
* Allocated CPUs per project&mdash;displays the amount of CPUs allocated per project.
77+
* Active projects&mdash;displays the active projects with the CPU allocation and amount of running and pending workloads.
78+
* Utilization per resource type&mdash;displays the CPU compute and CPU memory utilization over time.
79+
* CPU compute utilization&mdash;displays the current CPU compute utilization.
80+
* CPU memory utilization&mdash;displays the current CPU memory utilization.
81+
* Pending workloads&mdash;displays the requested resources and wait time for workloads in the pending status.
82+
* Workloads with error&mdash;displays the amount of workloads that are currently not running due to an error.
83+
* Workload Count per CPU Compute Utilization&mdash;
84+
* 5 longest running workloads&mdash;displays up to 5 of workloads that have the longest running time.
6485

65-
![](img/idle-gpus.png)
66-
-->
6786
**Analysis and Suggested actions**:
6887

6988
| Review | Analysis & Actions |
7089
|---------|---------------------|
71-
| Interactive Workloads are too frequently idle | * Consider setting time limits for interactive Workloads through the Projects tab. <br> * Consider also reducing GPU/CPU quotas for specific Projects to encourage users to run more training Workloads as opposed to interactive Workloads (note that interactive Workloads can not use more than the GPU/CPU quota assigned to their Project). |
90+
| Interactive Workloads are too frequently idle | *Consider setting time limits for interactive Workloads through the Projects tab. <br>* Consider also reducing GPU/CPU quotas for specific Projects to encourage users to run more training Workloads as opposed to interactive Workloads (note that interactive Workloads can not use more than the GPU/CPU quota assigned to their Project). |
7291
| Training Workloads are too frequently idle | Identify and notify the right users and work with them to improve the utilization of their training scripts |
7392

7493
### Workloads with an Error
7594

7695
Search for Workloads with an error status. These Workloads may be holding GPUs/CPUs without actually using them.
77-
<!--
78-
**How to**: view the following panel:
7996

80-
![](img/jobs-with-errors.png)
81-
-->
8297
**Analysis and Suggested actions**:
8398

8499
Search for workloads with an Error status on the Workloads view and discuss with the Job owner. Consider deleting these Workloads to free up the resources for other users.
85100

86101
### Workloads with a Long Duration
87102

88103
View list of 5 longest Workloads.
89-
<!--
90-
**How to**: view the following panel:
91104

92-
![](img/long-jobs.png)
93-
-->
94105
**Analysis and Suggested actions**:
95106

96107
| Review | Analysis & Actions |
@@ -101,11 +112,7 @@ View list of 5 longest Workloads.
101112
### Job Queue
102113

103114
Identify queueing bottlenecks.
104-
<!--
105-
**How to**: view the following panel:
106115

107-
![](img/queue.png)
108-
-->
109116
**Analysis and Suggested actions**:
110117

111118
| Review | Analysis & Actions |
@@ -115,70 +122,6 @@ Identify queueing bottlenecks.
115122

116123
Also, check the command that the user used to submit the job. The Researcher may have requested a specific Node for that Job.
117124

118-
## Quota management dashboard
119-
120-
The Quota management dashboard provides an efficient means to monitor and manage resource utilization within the AI cluster. The dashboard is divided into sections with essential metrics and data visualizations to identify resource usage patterns, potential bottlenecks, and areas for optimization. The sections of the dashboard include:
121-
122-
* **Add Filter**
123-
* **Quota / Total**
124-
* **Allocated / Quota**
125-
* **Pending workloads**
126-
* **Quota by node pool**
127-
* **Allocation by node pool**
128-
* **Pending workloads by node pool**
129-
* **Departments with lowest allocation by node pool**
130-
* **Projects with lowest allocation ratio by node pool**
131-
* **Over time allocation / quota**
132-
133-
### Add Filter
134-
135-
Use the *Add Filter* dropdown to select filters for the dashboard. The filters will change the data shown on the dashboard. Available filters are:
136-
137-
* Departments
138-
* Projects
139-
* Nodes
140-
141-
Select a filter from the dropdown, then select a item from the list, and press apply.
142-
143-
!!! Note
144-
You can create a filter with multiple categories, but you can use each category and item only once.
145-
146-
### Quota / Total
147-
148-
This section shows the number of GPUs that are in the quota based on the filter selection. The quota of GPUs is the number of GPUs that are reserved for use.
149-
150-
### Allocated / Quota
151-
152-
This section shows the number of GPUs that are allocated based on the filter selection. Allocated GPUs are the number of GPUs that are being used.
153-
154-
### Pending workloads
155-
156-
This section shows the number workloads that are pending based on the filter selection. Pending workloads are workloads that have not started.
157-
158-
### Quota by node pool
159-
160-
This section shows the quota of GPUs by node pool based on the filter. The quota is the number of GPUs that are reserved for use. You can drill down into the data in this section by pressing on the graph or the link at the bottom of the section.
161-
162-
### Allocation by node pool
163-
164-
This section shows the allocation of GPUs by node pool based on the filter. The allocation is the number of GPUs that are being used. You can drill down into the data in this section by pressing on the graph or the link at the bottom of the section.
165-
166-
### Pending workloads by node pool
167-
168-
This section shows the number of pending workloads by node pool. You can drill down into the data in this section by pressing on the graph or the link at the bottom of the section.
169-
170-
### Departments with lowest allocation by node pool
171-
172-
This section shows the departments with the lowest allocation of GPUs by percentage relative to the total number of GPUs.
173-
174-
### Projects with lowest allocation ratio by node pool
175-
176-
This section shows the projects with the lowest allocation of GPUS by percentage relative to the total number of GPUs.
177-
178-
### Over time allocation / quota
179-
180-
This section shows the allocation of GPUs from the quota over a period of time.
181-
182125
## Analytics Dashboard
183126

184127
The Analytics dashboard provides means for viewing historical data on cluster information such as:
@@ -293,9 +236,9 @@ Provides a holistic, aggregated view across Clusters, including information abou
293236

294237
This dashboard enables users and admins to view consumption usage using run:AI services. The dashboard provides views based on configurable filters and timelines. The dashboard also provides costing analysis for GPU, CPU, and memory costs for the system.
295238

296-
![!copnsumption dasboard](img/consumption-dashboard.png)
239+
![!consumption dashboard](img/consumption-dashboard.png)
297240

298-
The dashboard has 4 dashlets for:
241+
The dashboard has 4 tiles for:
299242

300243
* Cumulative GPU allocation per Project or Department
301244
* Cumulative CPU allocation per Project or Department
@@ -309,11 +252,14 @@ Use the drop down menus at the top of the dashboard to apply filters for:
309252
* Per department (single, multiple or all)
310253
* Per cluster (single, multiple, all)
311254

312-
Use cost fields at the top of the dashboard to provides calculated costs for:
255+
To enable the Consumption Dashboard:
313256

314-
* GPU
315-
* CPU
316-
* CPU memory (in GB)
257+
1. Press the `Tools & Settings` icon, then press `General`.
258+
2. Open the `Analytics` pane and toggle the *Consumption* switch to enable the feature.
259+
3. Enter the cost of:
260+
1. GPU compute / Hour
261+
2. CPU compute / Hour
262+
3. CPU memory / Hour
317263

318264
Use the time picker dropdown to select relative time range options and set custom absolute time ranges.
319265
You can change the Timezone and fiscal year settings from the time range controls by clicking the Change time settings button.
@@ -362,3 +308,67 @@ The dashboard has a graph of the GPU allocation over time.
362308
The dashboard has a graph of the Project over-quota GPU consumption.
363309

364310
!![](img/consumption-dashboard-project-over-quota-graph.png)
311+
312+
## Quota management dashboard
313+
314+
The Quota management dashboard provides an efficient means to monitor and manage resource utilization within the AI cluster. The dashboard is divided into sections with essential metrics and data visualizations to identify resource usage patterns, potential bottlenecks, and areas for optimization. The sections of the dashboard include:
315+
316+
* **Add Filter**
317+
* **Quota / Total**
318+
* **Allocated / Quota**
319+
* **Pending workloads**
320+
* **Quota by node pool**
321+
* **Allocation by node pool**
322+
* **Pending workloads by node pool**
323+
* **Departments with lowest allocation by node pool**
324+
* **Projects with lowest allocation ratio by node pool**
325+
* **Over time allocation / quota**
326+
327+
### Add Filter
328+
329+
Use the *Add Filter* dropdown to select filters for the dashboard. The filters will change the data shown on the dashboard. Available filters are:
330+
331+
* Departments
332+
* Projects
333+
* Nodes
334+
335+
Select a filter from the dropdown, then select a item from the list, and press apply.
336+
337+
!!! Note
338+
You can create a filter with multiple categories, but you can use each category and item only once.
339+
340+
### Quota / Total
341+
342+
This section shows the number of GPUs that are in the quota based on the filter selection. The quota of GPUs is the number of GPUs that are reserved for use.
343+
344+
### Allocated / Quota
345+
346+
This section shows the number of GPUs that are allocated based on the filter selection. Allocated GPUs are the number of GPUs that are being used.
347+
348+
### Pending workloads
349+
350+
This section shows the number workloads that are pending based on the filter selection. Pending workloads are workloads that have not started.
351+
352+
### Quota by node pool
353+
354+
This section shows the quota of GPUs by node pool based on the filter. The quota is the number of GPUs that are reserved for use. You can drill down into the data in this section by pressing on the graph or the link at the bottom of the section.
355+
356+
### Allocation by node pool
357+
358+
This section shows the allocation of GPUs by node pool based on the filter. The allocation is the number of GPUs that are being used. You can drill down into the data in this section by pressing on the graph or the link at the bottom of the section.
359+
360+
### Pending workloads by node pool
361+
362+
This section shows the number of pending workloads by node pool. You can drill down into the data in this section by pressing on the graph or the link at the bottom of the section.
363+
364+
### Departments with lowest allocation by node pool
365+
366+
This section shows the departments with the lowest allocation of GPUs by percentage relative to the total number of GPUs.
367+
368+
### Projects with lowest allocation ratio by node pool
369+
370+
This section shows the projects with the lowest allocation of GPUS by percentage relative to the total number of GPUs.
371+
372+
### Over time allocation / quota
373+
374+
This section shows the allocation of GPUs from the quota over a period of time.

0 commit comments

Comments
 (0)