Handling worker preemption #847

GonzaloSaez · 2025-04-14T14:05:37Z

GonzaloSaez
Apr 14, 2025

Hi all,

I'm interested in testing riverqueue in my systems, but I just realized that it may not fit my deployment model. Some of my workers run on preemptible VMs/nodes and may not be able to finish the assigned jobs. In case of preemption, I'd have to rely on the rescuer which may take too long to realize that the job has to be rescheduled. My jobs may take a long while to finish and it's hard to distinguish between a timeout and the jobs really taking a long while.

I think that we can use the Snooze functionality on context cancellation + signal handler to re-enqueue the task but that seems brittle (i.e. it's not clear whether the workers have enough time to notice the cancellation + update the DB to snooze the jobs upon preemption). Also, if the node has health issues, there will certainly be no time to use Snooze.

Is there any heartbeat mechanism to assess whether the workers are alive? asynq has a "lease" concept for example, and requeues jobs when the lease for a worker is not renewed on time.

Thanks!

brandur · 2025-04-16T00:39:21Z

brandur
Apr 16, 2025
Maintainer

I remember @bgentry's talked about being interested in doing this one before, but to my knowledge neither of us are actively working on it at the moment. The right answer is probably to not expect it too immediately, but that we'll probably get to it at some point in the coming months.

0 replies

bgentry · 2025-04-16T02:13:24Z

bgentry
Apr 16, 2025
Maintainer

The foundations for more reliable liveness detection are now in RIver Pro, though they're not used for that purpose yet.

More to the point, though, I wonder if this is really needed for your use case? When preempting a VM, does your infrastructure provider offer any mechanism to signal your OS/container/process about the impending shutdown? I know this exists for both AWS and GCP for example, with either of those giving you plenty of time (if you're watching the appropriate input) for your app to cancel any ongoing work and to avoid needing to rely on the rescuer.

For example if you can detect the impending shutdown, you can call StopAndCancel on your River client. This has the effect of not only halting fetches, but it also immediately cancels the context for any active jobs. Your workers can detect this with the specific ctx.Err() and differentiate vs timeouts or other causes. In such a scenario, you can return a river.JobSnooze(0) to immediately retry ASAP on another worker, or you can return a normal error to use your normal backoff/retry strategy. The River client will save any of these shutdown results to the database prior to actually stopping (unless its machine is killed).

Some of this is illustrated in this GracefulShutdown example.

0 replies

GonzaloSaez · 2025-04-16T07:26:55Z

GonzaloSaez
Apr 16, 2025
Author

Thanks @bgentry @brandur for the insight. It's true that major cloud providers signal preemption ~30s before the shutdown happens which should be enough to snooze all jobs if the machine is not heavily overloaded. That said preemption is not the only concern here. k8s pods may be killed without respecting the shutdown grace period which results in the same issue as the one describe in this post.

I have implemented a simple heartbeat mechanism in my app by adding a new table with Job ID + latest heartbeat and checking the latest heartbeat compared with the current time for running jobs. If the latest heartbeat is too old I snooze the job from the client. It's a bit hacky but as long as the main executor interfaces remain stable it should do the work. Let me know if you think there's a better way with the current tooling, thank you!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Handling worker preemption #847

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 3 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Handling worker preemption #847

Uh oh!

Uh oh!

GonzaloSaez Apr 14, 2025

Replies: 3 comments

Uh oh!

brandur Apr 16, 2025 Maintainer

Uh oh!

bgentry Apr 16, 2025 Maintainer

Uh oh!

GonzaloSaez Apr 16, 2025 Author

GonzaloSaez
Apr 14, 2025

brandur
Apr 16, 2025
Maintainer

bgentry
Apr 16, 2025
Maintainer

GonzaloSaez
Apr 16, 2025
Author