dutyCycle loses data

`dutyCycle` is the GPU utilization during the last "sample period" of the driver, according to NVIDIA docs:

> Percent of time over the past sample period during which one or more kernels was executing on the GPU.

> Utilization information for a device. Each sample period may be between 1 second and 1/6 second, depending on the product being queried.

https://docs.nvidia.com/deploy/nvml-api/structnvmlUtilization__t.html#structnvmlUtilization__t

So this can be a very short period. Prometheus scrape intervals are usually 10 seconds or longer, so data is almost certainly lost:
Let's say a workload uses 100% of the GPU for one second, then sleeps one second - the GPU is 50% busy on average. We don't know exactly when Prometheus will scrape, but there's a good chance it would only see 100% or 0% every time it does, so the recorded utilization will probably be incorrect.

Instead it would be better to have a `..._seconds_total` counter, like it's done for CPU utilization: https://www.robustperception.io/understanding-machine-cpu-usage
This way we wouldn't lose data due to long Prometheus sample periods, but it would probably require some more work in the exporter (poll data at a higher frequency).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

dutyCycle loses data #12

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

dutyCycle loses data #12

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions