retryPolicy: OnError does not retry Pod allocation failure due to unhealthy devices (nvidia.com/gpu) #14445
xiki-tempula
started this conversation in
Ideas
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Description:
We are utilizing Argo Workflows' retryStrategy with retryPolicy: OnError to automatically retry pods that encounter transient infrastructure-related errors, while allowing pods with failures originating from business logic within the container to fail permanently.
Our configured retryStrategy is as follows:
Expected Behavior:
We expect that when a pod fails to schedule or run due to infrastructure issues (like temporary resource unavailability), the OnError policy should recognize this as a system-level error and trigger the retry mechanism according to the defined backoff and limit. Business logic errors (e.g., an exception within the application code) should correctly result in a final Failed state without retries.
Actual Behavior:
We have observed instances where pods fail to start, and the workflow does not initiate the retry strategy. The specific error message associated with these failures is:
MESSAGE: Pod was rejected: Allocate failed due to no healthy devices present; cannot allocate unhealthy devices nvidia.com/gpu
It appears that Argo Workflows classifies this specific pod allocation failure related to unhealthy GPU devices as a Failure rather than an Error. Consequently, the retryPolicy: OnError condition is not met, and the pod fails permanently without any retry attempts, contrary to our intention of retrying infrastructure-related allocation problems.
Question/Suggestion:
Should pod allocation failures, specifically errors like "Allocate failed due to no healthy devices present", be classified as Error rather than Failure within the context of the retryPolicy? Classifying such infrastructure-level scheduling issues as Error would align better with the purpose of retryPolicy: OnError, allowing workflows to recover from temporary resource unavailability automatically.
Beta Was this translation helpful? Give feedback.
All reactions