You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I've been looking into using preemptible nodes on gke with argo workflows, and I'm not sure how best to handle retries. I would like to retry step failures due to preemption, but not due to user error (e.g. typos in my code). I also want to retry errors. But now that gke uses graceful node shutdown for preemptions, preempted pods get the Failed phase in argo, which puts them in the same category as user error. It seems like I have a few options, but none of them seems quite right:
Use the Always retry policy and retry user error. This means retrying user error when I'd prefer not to.
Use the OnTransientError retry policy and define TRANSIENT_ERROR_PATTERN to the message given to preempted pods: "Node is shutting, evicting pods". This relies on an argo feature that might go away, and which I can't configure at the step or workflow level. And iiuc it doesn't allow me to retry non-transient errors, which ideally I would also be able to do.
Add a signal handler to my code that catches SIGTERM, then returns an exit code that I can check in the retry strategy's expression. This seems kind of complicated, and requires modifying all of my tasks, but it could work.
I wonder if an easier and more reliable option would be to surface the pod's message to the template context passed to expression. Then I could check for preempted steps without modifying application code or using unstable environment flags. What do you think @sarabala1979@alexec? Would that feature make sense, or is there a better way to retry preempted steps?
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
-
I've been looking into using preemptible nodes on gke with argo workflows, and I'm not sure how best to handle retries. I would like to retry step failures due to preemption, but not due to user error (e.g. typos in my code). I also want to retry errors. But now that gke uses graceful node shutdown for preemptions, preempted pods get the
Failed
phase in argo, which puts them in the same category as user error. It seems like I have a few options, but none of them seems quite right:Always
retry policy and retry user error. This means retrying user error when I'd prefer not to.OnTransientError
retry policy and defineTRANSIENT_ERROR_PATTERN
to the message given to preempted pods: "Node is shutting, evicting pods". This relies on an argo feature that might go away, and which I can't configure at the step or workflow level. And iiuc it doesn't allow me to retry non-transient errors, which ideally I would also be able to do.SIGTERM
, then returns an exit code that I can check in the retry strategy'sexpression
. This seems kind of complicated, and requires modifying all of my tasks, but it could work.I wonder if an easier and more reliable option would be to surface the pod's
message
to the template context passed toexpression
. Then I could check for preempted steps without modifying application code or using unstable environment flags. What do you think @sarabala1979 @alexec? Would that feature make sense, or is there a better way to retry preempted steps?cc @jli @maxhully
Beta Was this translation helpful? Give feedback.
All reactions