Skip to content

KubeadmControlPlane controller hangs during cert expiry check in unstable network conditions #12460

@LingyanCao

Description

@LingyanCao

What steps did you take and what happened?

We're using Cluster API to provision Kubernetes clusters on edge hardware, where connectivity between the management cluster (in the cloud) and the target clusters (on the edge) goes through a network with intermittent connectivity and long-lived request limitations (e.g., NAT or HTTPS proxying).

We’re encountering an issue where the KubeadmControlPlane controller hangs for hours to days during reconciliation, specifically at the certificate expiry check step for the kube-apiserver.

Eventually, the controller logs an error like:
Reconciler error: failed to reconcile certificate expiry for Machine/...: unable to get certificate expiry for kube-apiserver on Node/...: unable to dial to kube-apiserver: error upgrading connection: error sending request: Post "https://.../portforward?...": unexpected EOF
In verbose logs, reconciliation gets stuck at:
Reconciling certificate expiry
This behavior is consistently reproducible in our environment and blocks reconciliation for other clusters as well.

What did you expect to happen?

The controller should:
Handle failure or timeout of port-forward operations gracefully
Avoid indefinitely blocking reconciliation due to network instability

Cluster API version

1.9.5

Kubernetes version

1.30.6

Anything else you would like to add?

Relevant code for the kube-apiserver port-forwarding during certificate expiry check:
https://github.com/kubernetes-sigs/cluster-api/blob/main/controlplane/kubeadm/internal/proxy/dial.go#L87

In our environment, I also observed similar port-forward failures when accessing etcd pods. However, the etcd code path wraps the port-forward logic in a context with a 15-second timeout:
https://github.com/kubernetes-sigs/cluster-api/blob/main/controlplane/kubeadm/internal/etcd/etcd.go#L85

As a result, even when etcd port-forwarding fails, the reconciliation completes and requeues gracefully. In contrast, the kube-apiserver path appears to lack such timeout handling, causing the controller to hang indefinitely when port-forwarding is blocked or dropped.

A similar timeout strategy for kube-apiserver access could improve resilience in environments with unstable or proxied networking.

Label(s) to be applied

/kind bug
/area/control-plane
/area/provider/control-plane-kubeadm
One or more /area label. See https://github.com/kubernetes-sigs/cluster-api/labels?q=area for the list of labels.

Metadata

Metadata

Assignees

Labels

kind/bugCategorizes issue or PR as related to a bug.priority/important-soonMust be staffed and worked on either currently, or very soon, ideally in time for the next release.triage/acceptedIndicates an issue or PR is ready to be actively worked on.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions