You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/developer/metrics/metrics.md
+35-2Lines changed: 35 additions & 2 deletions
Original file line number
Diff line number
Diff line change
@@ -15,8 +15,7 @@ The purpose of this document is to detail the structure and purpose of metrics e
15
15
Run:ai uses [Prometheus](https://prometheus.io){target=_blank} for collecting and querying metrics.
16
16
17
17
!!! Note
18
-
From cluster version 2.17 onwards, we will support metrics via the Run:ai API and direct metrics queries will be deprecated.
19
-
<!-- TODO 1. Add note as bullet to What's new 2. Define better "Direct metrics query" 3. Add this note to deprecation notifications-->
18
+
From cluster version 2.17 onwards, Run:ai will support metrics via the Run:ai API and direct metrics queries (metrics that are queried directly from Prometheus) will be deprecated.
20
19
21
20
## Published Run:ai Metrics
22
21
@@ -111,6 +110,40 @@ Run:ai exports other metrics emitted by NVIDIA and Kubernetes packages, as follo
111
110
112
111
For additional information, see Kubernetes [kube-state-metrics](https://github.com/kubernetes/kube-state-metrics){target=_blank} and NVIDIA [dcgm exporter](https://github.com/NVIDIA/gpu-monitoring-tools){target=_blank}.
113
112
113
+
## Changed metrics and API mapping
114
+
115
+
Starting in version 2.17, Run:ai metrics are available as API endpoints. Using the API endpoints is more efficient and provides an easier way of retrieving metrics in any application. The following table lists the metrics that were changed.
116
+
117
+
| Metric name in version 2.16 | 2.17 Change Description | 2.17 API Endpoint |
118
+
| --- | --- | --- |
119
+
| runai\_active\_job\_cpu\_requested\_cores | changed to API |https://app.run.ai/api/v1/workloads/{workloadId}/metrics ; with "CPU\_REQUEST" metricType |
120
+
| runai\_active\_job\_memory\_requested\_bytes | changed to API |https://app.run.ai/api/v1/workloads/{workloadId}/metrics ; with "CPU\_MEMORY\_REQUEST" metricType |
121
+
| runai\_cluster\_cpu\_utilization | changed to API |https://app.run.ai/api/v2/clusters/{clusterUuid}/metrics ; with "CPU\_UTILIZATION" metricType |
122
+
| runai\_cluster\_memory\_utilization | changed to API |https://app.run.ai/api/v2/clusters/{clusterUuid}/metrics ; with "CPU\_MEMORY\_UTILIZATION" metricType |
123
+
| runai\_gpu\_utilization\_non\_fractional\_jobs | no longer available ||
| runai\_gpu\_utilization\_per\_pod\_per\_gpu | changed to API |https://app.run.ai/api/v1/workloads/{workloadId}/pods/{podId}/metrics ; with "GPU\_UTILIZATION\_PER\_GPU" metricType |
126
+
| runai\_gpu\_utilization\_per\_workload | changed to API + labels changed |https://app.run.ai/api/v1/workloads/{workloadId}/metrics ; with "GPU\_UTILIZATION" metricType |
127
+
| runai\_job\_image | no longer available ||
128
+
| runai\_job\_requested\_gpu\_memory | changed to API + renamed to: "runai\_requested\_gpu\_memory\_mb\_per\_workload" with different labels |https://app.run.ai/api/v1/workloads/{workloadId}/metrics ; with "GPU\_MEMORY\_REQUEST" metricType |
129
+
| runai\_job\_requested\_gpus | renamed to: "runai\_requested\_gpus\_per\_workload" with different labels ||
130
+
| runai\_job\_total\_runtime | renamed to: "runai\_run\_time\_seconds\_per\_workload" with different labels ||
131
+
| runai\_job\_total\_wait\_time | renamed to: "runai\_wait\_time\_seconds\_per\_workload" with different labels ||
132
+
| runai\_gpu\_memory\_used\_mebibytes\_per\_workload | changed to API + labels changed |https://app.run.ai/api/v1/workloads/{workloadId}/metrics ; with "GPU\_MEMORY\_USAGE" metricType |
133
+
| runai\_gpu\_memory\_used\_mebibytes\_per\_pod\_per\_gpu | changed to API + labels changed |https://app.run.ai/api/v1/workloads/{workloadId}/pods/{podId}/metrics ; with "GPU\_MEMORY\_USAGE\_PER\_GPU" metricType |
134
+
| runai\_node\_gpu\_used\_memory\_bytes | renamed and changed units: "runai\_gpu\_memory\_used\_mebibytes\_per\_node" ||
135
+
| runai\_node\_total\_memory\_bytes | renamed and changed units: "runai\_gpu\_memory\_total\_mebibytes\_per\_node" ||
136
+
| runai\_project\_info | labels changed ||
137
+
| runai\_active\_job\_cpu\_limits | changed to API + renamed to: "runai\_cpu\_limits\_per\_active\_workload" |https://app.run.ai/api/v1/workloads/{workloadId}/metrics ; with "CPU\_LIMIT" metricType |
138
+
| runai\_job\_cpu\_usage | changed to API + labels changed |https://app.run.ai/api/v1/workloads/{workloadId}/metrics ; with "CPU\_USAGE" metricType |
139
+
| runai\_active\_job\_memory\_limits | changed to API + renamed to: "runai\_memory\_limits\_per\_active\_workload" |https://app.run.ai/api/v1/workloads/{workloadId}/metrics ; with "CPU\_MEMORY\_LIMIT" metricType |
140
+
| runai\_running\_job\_memory\_requested\_bytes | was a duplication of "runai\_active\_job\_memory\_requested\_bytes", see above ||
141
+
| runai\_job\_memory\_used\_bytes | changed to API + labels changed |https://app.run.ai/api/v1/workloads/{workloadId}/metrics ; with "CPU\_MEMORY\_USAGE" metricType |
142
+
| runai\_job\_swap\_memory\_used\_bytes | no longer available ||
0 commit comments