Handling worker preemption #847
Replies: 3 comments
-
I remember @bgentry's talked about being interested in doing this one before, but to my knowledge neither of us are actively working on it at the moment. The right answer is probably to not expect it too immediately, but that we'll probably get to it at some point in the coming months. |
Beta Was this translation helpful? Give feedback.
-
The foundations for more reliable liveness detection are now in RIver Pro, though they're not used for that purpose yet. More to the point, though, I wonder if this is really needed for your use case? When preempting a VM, does your infrastructure provider offer any mechanism to signal your OS/container/process about the impending shutdown? I know this exists for both AWS and GCP for example, with either of those giving you plenty of time (if you're watching the appropriate input) for your app to cancel any ongoing work and to avoid needing to rely on the rescuer. For example if you can detect the impending shutdown, you can call Some of this is illustrated in this |
Beta Was this translation helpful? Give feedback.
-
Thanks @bgentry @brandur for the insight. It's true that major cloud providers signal preemption ~30s before the shutdown happens which should be enough to snooze all jobs if the machine is not heavily overloaded. That said preemption is not the only concern here. k8s pods may be killed without respecting the shutdown grace period which results in the same issue as the one describe in this post. I have implemented a simple heartbeat mechanism in my app by adding a new table with Job ID + latest heartbeat and checking the latest heartbeat compared with the current time for running jobs. If the latest heartbeat is too old I snooze the job from the client. It's a bit hacky but as long as the main executor interfaces remain stable it should do the work. Let me know if you think there's a better way with the current tooling, thank you! |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi all,
I'm interested in testing riverqueue in my systems, but I just realized that it may not fit my deployment model. Some of my workers run on preemptible VMs/nodes and may not be able to finish the assigned jobs. In case of preemption, I'd have to rely on the
rescuer
which may take too long to realize that the job has to be rescheduled. My jobs may take a long while to finish and it's hard to distinguish between a timeout and the jobs really taking a long while.I think that we can use the
Snooze
functionality on context cancellation + signal handler to re-enqueue the task but that seems brittle (i.e. it's not clear whether the workers have enough time to notice the cancellation + update the DB to snooze the jobs upon preemption). Also, if the node has health issues, there will certainly be no time to useSnooze
.Is there any heartbeat mechanism to assess whether the workers are alive? asynq has a "lease" concept for example, and requeues jobs when the lease for a worker is not renewed on time.
Thanks!
Beta Was this translation helpful? Give feedback.
All reactions