Skip to content

Conversation

@cre8ivejp
Copy link
Member

@cre8ivejp cre8ivejp commented Oct 29, 2025

Fix #2178

I noticed that after stopping the health check, it returns 503, but after a few seconds, it starts returning 200 again, causing issues making the pod ready again to receive new requests.

Things done

  • Fix race condition when stopping health check during shutdown
  • Changed to fail readiness probe, right after getting the sigterm
  • Added unit tests

Signed-off-by: Alessandro Yuichi Okimoto <yuichijpn@gmail.com>
Comment on lines +108 to +113
// Don't run checks if already stopped
// This prevents the race condition where Stop() sets status to Unhealthy
// but then check() overrides it back to Healthy
if hc.isStopped() {
return
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem was the Run() running in another goroutine.
So, we ensure that it will skip the check after stopping the http server.

@cre8ivejp cre8ivejp marked this pull request as ready for review October 29, 2025 09:32
@Ubisoft-potato
Copy link
Collaborator

https://github.com/bucketeer-io/bucketeer/blob/chore-improve-shutdown/pkg/subscriber/cmd/server/server.go#L364-L370

We should add sleep propagationDelay in subscriber service, too.

// Wait for K8s endpoint propagation
// This prevents "context deadline exceeded" errors during high traffic.
time.Sleep(propagationDelay)

@cre8ivejp
Copy link
Member Author

https://github.com/bucketeer-io/bucketeer/blob/chore-improve-shutdown/pkg/subscriber/cmd/server/server.go#L364-L370

We should add sleep propagationDelay in subscriber service, too.

// Wait for K8s endpoint propagation
// This prevents "context deadline exceeded" errors during high traffic.
time.Sleep(propagationDelay)

@Ubisoft-potato, the reason I didn't use a delay is because the subscriber doesn't receive http or grpc requests. It only pulls messages from pubsub.

Signed-off-by: Alessandro Yuichi Okimoto <yuichijpn@gmail.com>
@Ubisoft-potato
Copy link
Collaborator

https://github.com/bucketeer-io/bucketeer/blob/chore-improve-shutdown/pkg/subscriber/cmd/server/server.go#L364-L370
We should add sleep propagationDelay in subscriber service, too.

// Wait for K8s endpoint propagation
// This prevents "context deadline exceeded" errors during high traffic.
time.Sleep(propagationDelay)

@Ubisoft-potato, the reason I didn't use a delay is because the subscriber doesn't receive http or grpc requests. It only pulls messages from pubsub.

Understand! Then we also dont need wait propagationDelay in the deployment.yaml
Thanks for explaination!

Copy link
Collaborator

@Ubisoft-potato Ubisoft-potato left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! 👍

@cre8ivejp cre8ivejp merged commit d289118 into main Oct 30, 2025
11 checks passed
@cre8ivejp cre8ivejp deleted the chore-improve-shutdown branch October 30, 2025 04:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

fix: race condition when stopping health check during shutdown

3 participants