Retrying preempted steps #7075

jmcarp · 2021-10-26T19:21:35Z

jmcarp
Oct 26, 2021

I've been looking into using preemptible nodes on gke with argo workflows, and I'm not sure how best to handle retries. I would like to retry step failures due to preemption, but not due to user error (e.g. typos in my code). I also want to retry errors. But now that gke uses graceful node shutdown for preemptions, preempted pods get the Failed phase in argo, which puts them in the same category as user error. It seems like I have a few options, but none of them seems quite right:

Use the Always retry policy and retry user error. This means retrying user error when I'd prefer not to.
Use the OnTransientError retry policy and define TRANSIENT_ERROR_PATTERN to the message given to preempted pods: "Node is shutting, evicting pods". This relies on an argo feature that might go away, and which I can't configure at the step or workflow level. And iiuc it doesn't allow me to retry non-transient errors, which ideally I would also be able to do.
Add a signal handler to my code that catches SIGTERM, then returns an exit code that I can check in the retry strategy's expression. This seems kind of complicated, and requires modifying all of my tasks, but it could work.

I wonder if an easier and more reliable option would be to surface the pod's message to the template context passed to expression. Then I could check for preempted steps without modifying application code or using unstable environment flags. What do you think @sarabala1979 @alexec? Would that feature make sense, or is there a better way to retry preempted steps?

cc @jli @maxhully

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Retrying preempted steps #7075

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Retrying preempted steps #7075

Uh oh!

jmcarp Oct 26, 2021

Replies: 0 comments

jmcarp
Oct 26, 2021