Skip to content

Missing grace period in signals.SetupSignalHandler() leads to unreliable webhooks on controller shutdown. #3113

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
l0wl3vel opened this issue Feb 14, 2025 · 6 comments
Labels
lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.

Comments

@l0wl3vel
Copy link

l0wl3vel commented Feb 14, 2025

What steps did you take and what happened:

While investigating an issue in open-policy-agent/gatekeeper regarding unreliable webhooks we found that the default signal handler from controller-runtime leads to unreliable webhook handling on controller pod shutdown. The detailed investigation can be found in the gatekeeper repo: open-policy-agent/gatekeeper#3776

Because service endpoint handling in K8s is asynchronous, new connections still being established to the terminating pod for a short period after a pod has been terminated. This leads to new connections failing due to a missing grace period, which would allow the endpoint updates to propagate to the K8s nodes.

High availability of the webhook handler does not alleviate this behavior.

What did you expect to happen:

Shutting down a highly available webhook handler does not lead to failing webhooks.

Mitigations:

We also found that adding a preStopHook configured to wait a few seconds to the gatekeeper-controller-manager prevents failing requests.

I do not think this is the right mitigation though. It would require all operators with webhooks to add this flag to their distributed manifests. Fixing the issue in the signal handler will have a bigger impact due to only requiring a bump in the controller-runtime dependency.

Related abandoned PRs: #2601 #2607

@alvaroaleman
Copy link
Member

A binary using signals.SetupSignalHandler may or may not contain any sort of webhook, so that doesn't seem like a good place.

@l0wl3vel
Copy link
Author

l0wl3vel commented Feb 17, 2025

It is not the ideal place, agreed. It was the only place in gatekeeper where we could implement graceful shutdown to work around the controller-runtime behavior.

I looked into the PRs and like the approach. What would it take to revive the discussion around #2601?

It seems like the consensus is that it is a useful addition. The only point of discussion was if the default timeout should be zero or something else.

@alvaroaleman
Copy link
Member

I looked into the PRs and like the approach. What would it take to revive the discussion around #2601?

It looks like it just ended up getting abandoned by the author

@l0wl3vel
Copy link
Author

Would it be possible that I take over the PR?

@alvaroaleman
Copy link
Member

Would it be possible that I take over the PR?

Sure

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 18, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.
Projects
None yet
Development

No branches or pull requests

4 participants