-
Notifications
You must be signed in to change notification settings - Fork 80
Description
dutyCycle
is the GPU utilization during the last "sample period" of the driver, according to NVIDIA docs:
Percent of time over the past sample period during which one or more kernels was executing on the GPU.
Utilization information for a device. Each sample period may be between 1 second and 1/6 second, depending on the product being queried.
https://docs.nvidia.com/deploy/nvml-api/structnvmlUtilization__t.html#structnvmlUtilization__t
So this can be a very short period. Prometheus scrape intervals are usually 10 seconds or longer, so data is almost certainly lost:
Let's say a workload uses 100% of the GPU for one second, then sleeps one second - the GPU is 50% busy on average. We don't know exactly when Prometheus will scrape, but there's a good chance it would only see 100% or 0% every time it does, so the recorded utilization will probably be incorrect.
Instead it would be better to have a ..._seconds_total
counter, like it's done for CPU utilization: https://www.robustperception.io/understanding-machine-cpu-usage
This way we wouldn't lose data due to long Prometheus sample periods, but it would probably require some more work in the exporter (poll data at a higher frequency).