Ensure all nodes in the cluster see all as UP as a prerequisite for a topology change and each step of a rolling restart

### Description

As a result of investigating issues found in multi-datacenter bootstrap, we concluded that Operator does not currently follow topology change / rolling restart procedures. It should always ensure that all nodes in the cluster see all nodes as UP before progressing. See https://github.com/scylladb/scylladb/issues/25410#issuecomment-3270793089.

Despite https://github.com/scylladb/scylladb/commit/28c0a27467a16eacfa17fa8692fb2e433671055e alleviating the issue on the happy path, it has become very prevalent in multi-datacenter E2E tests as the DCs are rolling restarted throughout the bootstrap of the entire cluster.

### Acceptance criteria

- [ ] A mechanism ensuring all nodes in the cluster see all as UP as a prerequisite for the following operations exists
  - a topology change
  - each step of a rolling restart

### Notes

For more context, see:
- https://github.com/scylladb/scylladb/issues/25410
- https://github.com/scylladb/scylladb/issues/8275
- https://github.com/scylladb/scylladb/issues/16763
- https://github.com/scylladb/scylla-operator/issues/1077

A predicted implementational issue is that K8s uses the same readiness probe to determine if:
a) the node is ready to serve traffic,
b) the node can be brought down / added; the entire cluster is ready for a topology change / rolling restart if all nodes are ready. 

It seems like the two concepts should be separate, and we should try to implement it somehow. We currently implement the readiness probe to only answer a) by checking if a node 1. considers itself UP (by GET /gossiper/endpoint/live/ ) and 2. has local CQL port open (by GET /storage_service/native_transport). The other point is not addressed by the readiness probe - addressing it there would result in availability issues.

There is nothing preventing the probe from talking to other nodes. Although, if we ask the entire cluster for the status of a node for every single node with a set frequency, it will cause a lot of noise. This may become a huge problem in clusters with high numbers of nodes or in multi-datacenter clusters where intercluster node-to-node communication may be expensive.

/kind bug
/priority important-soon

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Ensure all nodes in the cluster see all as UP as a prerequisite for a topology change and each step of a rolling restart #2947

Description

Acceptance criteria

Notes

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Ensure all nodes in the cluster see all as UP as a prerequisite for a topology change and each step of a rolling restart #2947

Description

Description

Acceptance criteria

Notes

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions