KubeadmControlPlane controller hangs during cert expiry check in unstable network conditions

### What steps did you take and what happened?

We're using Cluster API to provision Kubernetes clusters on edge hardware, where connectivity between the management cluster (in the cloud) and the target clusters (on the edge) goes through a network with intermittent connectivity and long-lived request limitations (e.g., NAT or HTTPS proxying).

We’re encountering an issue where the KubeadmControlPlane controller hangs for hours to days during reconciliation, specifically at the certificate expiry check step for the kube-apiserver.

Eventually, the controller logs an error like:
**Reconciler error: failed to reconcile certificate expiry for Machine/...: unable to get certificate expiry for kube-apiserver on Node/...: unable to dial to kube-apiserver: error upgrading connection: error sending request: Post "https://.../portforward?...": unexpected EOF**
In verbose logs, reconciliation gets stuck at: 
**Reconciling certificate expiry**
This behavior is consistently reproducible in our environment and blocks reconciliation for other clusters as well.

### What did you expect to happen?

The controller should:
Handle failure or timeout of port-forward operations gracefully
Avoid indefinitely blocking reconciliation due to network instability

### Cluster API version

1.9.5

### Kubernetes version

1.30.6

### Anything else you would like to add?

Relevant code for the kube-apiserver port-forwarding during certificate expiry check:
https://github.com/kubernetes-sigs/cluster-api/blob/main/controlplane/kubeadm/internal/proxy/dial.go#L87

In our environment, I also observed similar port-forward failures when accessing etcd pods. However, the etcd code path wraps the port-forward logic in a context with a 15-second timeout:
https://github.com/kubernetes-sigs/cluster-api/blob/main/controlplane/kubeadm/internal/etcd/etcd.go#L85

As a result, even when etcd port-forwarding fails, the reconciliation completes and requeues gracefully. In contrast, the kube-apiserver path appears to lack such timeout handling, causing the controller to hang indefinitely when port-forwarding is blocked or dropped.

A similar timeout strategy for kube-apiserver access could improve resilience in environments with unstable or proxied networking.


### Label(s) to be applied

/kind bug
/area/control-plane
/area/provider/control-plane-kubeadm
One or more /area label. See https://github.com/kubernetes-sigs/cluster-api/labels?q=area for the list of labels.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

KubeadmControlPlane controller hangs during cert expiry check in unstable network conditions #12460

What steps did you take and what happened?

What did you expect to happen?

Cluster API version

Kubernetes version

Anything else you would like to add?

Label(s) to be applied

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

KubeadmControlPlane controller hangs during cert expiry check in unstable network conditions #12460

Description

What steps did you take and what happened?

What did you expect to happen?

Cluster API version

Kubernetes version

Anything else you would like to add?

Label(s) to be applied

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions