-
Notifications
You must be signed in to change notification settings - Fork 275
Description
/kind feature
Describe the solution you'd like
Currently the OpenStack loadbalancer monitor for the API server created by CAPO has hardcoded settings, as seen in:
cluster-api-provider-openstack/pkg/cloud/services/loadbalancer/loadbalancer.go
Lines 223 to 230 in 944b265
monitorCreateOpts := monitors.CreateOpts{ | |
Name: monitorName, | |
PoolID: poolID, | |
Type: "TCP", | |
Delay: 30, | |
Timeout: 5, | |
MaxRetries: 3, | |
} |
In other words, Delay: 30, Timeout: 5, MaxRetries: 3
means:
After 105 (=
3*(30+5)
) seconds of downtime, API server pool members will be marked as down, and when it's back up again, another 3 attempts/MaxRetries are needed for it to be added back in the pool again, i.e. 105 more seconds.
...which in turn means that if all API server members become unavailable at the same time for 1,5 minutes, it will mean downtime in total of at least 3,5 minutes (2*3*(30+5)
)
This may sound weird, but it's the official behavior of max-retries
to apply to taking members both down and up, as per https://docs.openstack.org/octavia/queens/user/guides/basic-cookbook.html#heath-monitor-options (quoted below):
max-retries
: Number of subsequent health checks a given back-end server must fail before it is considered down, or that a failed back-end server must pass to be considered up again.
Anything else you would like to add:
- We had a brief outage related to this when all our control-plane nodes were accidentally running on the same OpenStack Nova/hypervisor host which had a network issues/downtime.
- We'll soon try out the hard/soft anti-affinity policies, which will decrease the risk for this kind of failure, but faster recovery overall for API server LB pool members might still help.
- I haven't yet looked at if/how other CAPI providers solve this.