-
Notifications
You must be signed in to change notification settings - Fork 1.4k
Description
What steps did you take and what happened?
We're using Cluster API to provision Kubernetes clusters on edge hardware, where connectivity between the management cluster (in the cloud) and the target clusters (on the edge) goes through a network with intermittent connectivity and long-lived request limitations (e.g., NAT or HTTPS proxying).
We’re encountering an issue where the KubeadmControlPlane controller hangs for hours to days during reconciliation, specifically at the certificate expiry check step for the kube-apiserver.
Eventually, the controller logs an error like:
Reconciler error: failed to reconcile certificate expiry for Machine/...: unable to get certificate expiry for kube-apiserver on Node/...: unable to dial to kube-apiserver: error upgrading connection: error sending request: Post "https://.../portforward?...": unexpected EOF
In verbose logs, reconciliation gets stuck at:
Reconciling certificate expiry
This behavior is consistently reproducible in our environment and blocks reconciliation for other clusters as well.
What did you expect to happen?
The controller should:
Handle failure or timeout of port-forward operations gracefully
Avoid indefinitely blocking reconciliation due to network instability
Cluster API version
1.9.5
Kubernetes version
1.30.6
Anything else you would like to add?
Relevant code for the kube-apiserver port-forwarding during certificate expiry check:
https://github.com/kubernetes-sigs/cluster-api/blob/main/controlplane/kubeadm/internal/proxy/dial.go#L87
In our environment, I also observed similar port-forward failures when accessing etcd pods. However, the etcd code path wraps the port-forward logic in a context with a 15-second timeout:
https://github.com/kubernetes-sigs/cluster-api/blob/main/controlplane/kubeadm/internal/etcd/etcd.go#L85
As a result, even when etcd port-forwarding fails, the reconciliation completes and requeues gracefully. In contrast, the kube-apiserver path appears to lack such timeout handling, causing the controller to hang indefinitely when port-forwarding is blocked or dropped.
A similar timeout strategy for kube-apiserver access could improve resilience in environments with unstable or proxied networking.
Label(s) to be applied
/kind bug
/area/control-plane
/area/provider/control-plane-kubeadm
One or more /area label. See https://github.com/kubernetes-sigs/cluster-api/labels?q=area for the list of labels.