retryPolicy: OnError does not retry Pod allocation failure due to unhealthy devices (nvidia.com/gpu) #14445

xiki-tempula · 2025-05-02T13:22:54Z

xiki-tempula
May 2, 2025

Description:

We are utilizing Argo Workflows' retryStrategy with retryPolicy: OnError to automatically retry pods that encounter transient infrastructure-related errors, while allowing pods with failures originating from business logic within the container to fail permanently.

Our configured retryStrategy is as follows:

retryStrategy:
  backoff:
    duration: '1'
    factor: '2'   
  limit: '10'    
  retryPolicy: OnError

Expected Behavior:

We expect that when a pod fails to schedule or run due to infrastructure issues (like temporary resource unavailability), the OnError policy should recognize this as a system-level error and trigger the retry mechanism according to the defined backoff and limit. Business logic errors (e.g., an exception within the application code) should correctly result in a final Failed state without retries.

Actual Behavior:

We have observed instances where pods fail to start, and the workflow does not initiate the retry strategy. The specific error message associated with these failures is:

MESSAGE: Pod was rejected: Allocate failed due to no healthy devices present; cannot allocate unhealthy devices nvidia.com/gpu
It appears that Argo Workflows classifies this specific pod allocation failure related to unhealthy GPU devices as a Failure rather than an Error. Consequently, the retryPolicy: OnError condition is not met, and the pod fails permanently without any retry attempts, contrary to our intention of retrying infrastructure-related allocation problems.

Question/Suggestion:

Should pod allocation failures, specifically errors like "Allocate failed due to no healthy devices present", be classified as Error rather than Failure within the context of the retryPolicy? Classifying such infrastructure-level scheduling issues as Error would align better with the purpose of retryPolicy: OnError, allowing workflows to recover from temporary resource unavailability automatically.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

retryPolicy: OnError does not retry Pod allocation failure due to unhealthy devices (nvidia.com/gpu) #14445

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

retryPolicy: OnError does not retry Pod allocation failure due to unhealthy devices (nvidia.com/gpu) #14445

Uh oh!

xiki-tempula May 2, 2025

Replies: 0 comments

xiki-tempula
May 2, 2025