Skip to content

Configurable api server loadbalancer monitor #1221

@MPV

Description

@MPV

/kind feature

Describe the solution you'd like

Currently the OpenStack loadbalancer monitor for the API server created by CAPO has hardcoded settings, as seen in:

monitorCreateOpts := monitors.CreateOpts{
Name: monitorName,
PoolID: poolID,
Type: "TCP",
Delay: 30,
Timeout: 5,
MaxRetries: 3,
}

In other words, Delay: 30, Timeout: 5, MaxRetries: 3 means:

After 105 (=3*(30+5)) seconds of downtime, API server pool members will be marked as down, and when it's back up again, another 3 attempts/MaxRetries are needed for it to be added back in the pool again, i.e. 105 more seconds.

...which in turn means that if all API server members become unavailable at the same time for 1,5 minutes, it will mean downtime in total of at least 3,5 minutes (2*3*(30+5))

This may sound weird, but it's the official behavior of max-retries to apply to taking members both down and up, as per https://docs.openstack.org/octavia/queens/user/guides/basic-cookbook.html#heath-monitor-options (quoted below):

max-retries: Number of subsequent health checks a given back-end server must fail before it is considered down, or that a failed back-end server must pass to be considered up again.

Anything else you would like to add:

  1. We had a brief outage related to this when all our control-plane nodes were accidentally running on the same OpenStack Nova/hypervisor host which had a network issues/downtime.
    • We'll soon try out the hard/soft anti-affinity policies, which will decrease the risk for this kind of failure, but faster recovery overall for API server LB pool members might still help.
  2. I haven't yet looked at if/how other CAPI providers solve this.

Metadata

Metadata

Assignees

No one assigned

    Labels

    good first issueDenotes an issue ready for a new contributor, according to the "help wanted" guidelines.help wantedDenotes an issue that needs help from a contributor. Must meet "help wanted" guidelines.kind/featureCategorizes issue or PR as related to a new feature.lifecycle/staleDenotes an issue or PR has remained open with no activity and has become stale.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions