Skip to content

Commit fda6454

Browse files
Merge pull request #760 from jasonnovichRunAI:RUN-17720-Direct-metrics-to-API-table
RUN-17720-Direct-metrics-to-API-table
2 parents 459d17e + e8326c2 commit fda6454

File tree

2 files changed

+37
-4
lines changed

2 files changed

+37
-4
lines changed

.github/workflows/deploy-staging.yaml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
name: publish docs CI to staging
1+
name: deploy docs CI to staging
22

33
on:
44
workflow_dispatch:
@@ -45,4 +45,4 @@ jobs:
4545

4646
- name: Sync output to S3
4747
run: |
48-
aws s3 sync ./site/ s3://${{ inputs.bucket_name }} --delete
48+
aws s3 sync ./site/ s3://${{ inputs.bucket_name }} --delete

docs/developer/metrics/metrics.md

Lines changed: 35 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -15,8 +15,7 @@ The purpose of this document is to detail the structure and purpose of metrics e
1515
Run:ai uses [Prometheus](https://prometheus.io){target=_blank} for collecting and querying metrics.
1616

1717
!!! Note
18-
From cluster version 2.17 onwards, we will support metrics via the Run:ai API and direct metrics queries will be deprecated.
19-
<!-- TODO 1. Add note as bullet to What's new 2. Define better "Direct metrics query" 3. Add this note to deprecation notifications-->
18+
From cluster version 2.17 onwards, Run:ai will support metrics via the Run:ai API and direct metrics queries (metrics that are queried directly from Prometheus) will be deprecated.
2019

2120
## Published Run:ai Metrics
2221

@@ -111,6 +110,40 @@ Run:ai exports other metrics emitted by NVIDIA and Kubernetes packages, as follo
111110

112111
For additional information, see Kubernetes [kube-state-metrics](https://github.com/kubernetes/kube-state-metrics){target=_blank} and NVIDIA [dcgm exporter](https://github.com/NVIDIA/gpu-monitoring-tools){target=_blank}.
113112

113+
## Changed metrics and API mapping
114+
115+
Starting in version 2.17, Run:ai metrics are available as API endpoints. Using the API endpoints is more efficient and provides an easier way of retrieving metrics in any application. The following table lists the metrics that were changed.
116+
117+
| Metric name in version 2.16 | 2.17 Change Description | 2.17 API Endpoint |
118+
| --- | --- | --- |
119+
| runai\_active\_job\_cpu\_requested\_cores | changed to API | https://app.run.ai/api/v1/workloads/{workloadId}/metrics ; with "CPU\_REQUEST" metricType |
120+
| runai\_active\_job\_memory\_requested\_bytes | changed to API | https://app.run.ai/api/v1/workloads/{workloadId}/metrics ; with "CPU\_MEMORY\_REQUEST" metricType |
121+
| runai\_cluster\_cpu\_utilization | changed to API | https://app.run.ai/api/v2/clusters/{clusterUuid}/metrics ; with "CPU\_UTILIZATION" metricType |
122+
| runai\_cluster\_memory\_utilization | changed to API | https://app.run.ai/api/v2/clusters/{clusterUuid}/metrics ; with "CPU\_MEMORY\_UTILIZATION" metricType |
123+
| runai\_gpu\_utilization\_non\_fractional\_jobs | no longer available | |
124+
| runai\_allocated\_gpu\_count\_per\_workload | labels changed | |
125+
| runai\_gpu\_utilization\_per\_pod\_per\_gpu | changed to API | https://app.run.ai/api/v1/workloads/{workloadId}/pods/{podId}/metrics ; with "GPU\_UTILIZATION\_PER\_GPU" metricType |
126+
| runai\_gpu\_utilization\_per\_workload | changed to API + labels changed | https://app.run.ai/api/v1/workloads/{workloadId}/metrics ; with "GPU\_UTILIZATION" metricType |
127+
| runai\_job\_image | no longer available | |
128+
| runai\_job\_requested\_gpu\_memory | changed to API + renamed to: "runai\_requested\_gpu\_memory\_mb\_per\_workload" with different labels | https://app.run.ai/api/v1/workloads/{workloadId}/metrics ; with "GPU\_MEMORY\_REQUEST" metricType |
129+
| runai\_job\_requested\_gpus | renamed to: "runai\_requested\_gpus\_per\_workload" with different labels | |
130+
| runai\_job\_total\_runtime | renamed to: "runai\_run\_time\_seconds\_per\_workload" with different labels | |
131+
| runai\_job\_total\_wait\_time | renamed to: "runai\_wait\_time\_seconds\_per\_workload" with different labels | |
132+
| runai\_gpu\_memory\_used\_mebibytes\_per\_workload | changed to API + labels changed | https://app.run.ai/api/v1/workloads/{workloadId}/metrics ; with "GPU\_MEMORY\_USAGE" metricType |
133+
| runai\_gpu\_memory\_used\_mebibytes\_per\_pod\_per\_gpu | changed to API + labels changed | https://app.run.ai/api/v1/workloads/{workloadId}/pods/{podId}/metrics ; with "GPU\_MEMORY\_USAGE\_PER\_GPU" metricType |
134+
| runai\_node\_gpu\_used\_memory\_bytes | renamed and changed units: "runai\_gpu\_memory\_used\_mebibytes\_per\_node" | |
135+
| runai\_node\_total\_memory\_bytes | renamed and changed units: "runai\_gpu\_memory\_total\_mebibytes\_per\_node" | |
136+
| runai\_project\_info | labels changed | |
137+
| runai\_active\_job\_cpu\_limits | changed to API + renamed to: "runai\_cpu\_limits\_per\_active\_workload" | https://app.run.ai/api/v1/workloads/{workloadId}/metrics ; with "CPU\_LIMIT" metricType |
138+
| runai\_job\_cpu\_usage | changed to API + labels changed | https://app.run.ai/api/v1/workloads/{workloadId}/metrics ; with "CPU\_USAGE" metricType |
139+
| runai\_active\_job\_memory\_limits | changed to API + renamed to: "runai\_memory\_limits\_per\_active\_workload" | https://app.run.ai/api/v1/workloads/{workloadId}/metrics ; with "CPU\_MEMORY\_LIMIT" metricType |
140+
| runai\_running\_job\_memory\_requested\_bytes | was a duplication of "runai\_active\_job\_memory\_requested\_bytes", see above | |
141+
| runai\_job\_memory\_used\_bytes | changed to API + labels changed | https://app.run.ai/api/v1/workloads/{workloadId}/metrics ; with "CPU\_MEMORY\_USAGE" metricType |
142+
| runai\_job\_swap\_memory\_used\_bytes | no longer available | |
143+
| runai\_gpu\_count\_per\_node | added labels | |
144+
| runai\_last\_gpu\_utilization\_time\_per\_workload | labels changed | |
145+
| runai\_gpu\_idle\_time\_per\_workload | renamed to: "runai\_gpu\_idle\_seconds\_per\_workload" with different labels | |
146+
114147
## Create custom dashboards
115148

116149
To create custom dashboards based on the above metrics, please contact Run:ai customer support.

0 commit comments

Comments
 (0)