Skip to content

Missing grace period in signals.SetupSignalHandler() leads to unreliable webhooks on controller shutdown. #3113

Open
@l0wl3vel

Description

@l0wl3vel

What steps did you take and what happened:

While investigating an issue in open-policy-agent/gatekeeper regarding unreliable webhooks we found that the default signal handler from controller-runtime leads to unreliable webhook handling on controller pod shutdown. The detailed investigation can be found in the gatekeeper repo: open-policy-agent/gatekeeper#3776

Because service endpoint handling in K8s is asynchronous, new connections still being established to the terminating pod for a short period after a pod has been terminated. This leads to new connections failing due to a missing grace period, which would allow the endpoint updates to propagate to the K8s nodes.

High availability of the webhook handler does not alleviate this behavior.

What did you expect to happen:

Shutting down a highly available webhook handler does not lead to failing webhooks.

Mitigations:

We also found that adding a preStopHook configured to wait a few seconds to the gatekeeper-controller-manager prevents failing requests.

I do not think this is the right mitigation though. It would require all operators with webhooks to add this flag to their distributed manifests. Fixing the issue in the signal handler will have a bigger impact due to only requiring a bump in the controller-runtime dependency.

Related abandoned PRs: #2601 #2607

Metadata

Metadata

Assignees

No one assigned

    Labels

    lifecycle/rottenDenotes an issue or PR that has aged beyond stale and will be auto-closed.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions