-
Notifications
You must be signed in to change notification settings - Fork 191
Description
Description
As a result of investigating issues found in multi-datacenter bootstrap, we concluded that Operator does not currently follow topology change / rolling restart procedures. It should always ensure that all nodes in the cluster see all nodes as UP before progressing. See scylladb/scylladb#25410 (comment).
Despite scylladb/scylladb@28c0a27 alleviating the issue on the happy path, it has become very prevalent in multi-datacenter E2E tests as the DCs are rolling restarted throughout the bootstrap of the entire cluster.
Acceptance criteria
- A mechanism ensuring all nodes in the cluster see all as UP as a prerequisite for the following operations exists
- a topology change
- each step of a rolling restart
Notes
For more context, see:
- Frequent failures in multi-datacenter bootstrap:
the topology coordinator rejected request to join the cluster: request canceled because some required nodes are dead
scylladb#25410 - RFE: Add /readyz and /healthz probes scylladb#8275
- Add endpoint to REST API to indicate whether this node is fully up scylladb#16763
- During rollout restart operator doesn't wait until previously restarted pod becomes part of a Scylla cluster #1077
A predicted implementational issue is that K8s uses the same readiness probe to determine if:
a) the node is ready to serve traffic,
b) the node can be brought down / added; the entire cluster is ready for a topology change / rolling restart if all nodes are ready.
It seems like the two concepts should be separate, and we should try to implement it somehow. We currently implement the readiness probe to only answer a) by checking if a node 1. considers itself UP (by GET /gossiper/endpoint/live/ ) and 2. has local CQL port open (by GET /storage_service/native_transport). The other point is not addressed by the readiness probe - addressing it there would result in availability issues.
There is nothing preventing the probe from talking to other nodes. Although, if we ask the entire cluster for the status of a node for every single node with a set frequency, it will cause a lot of noise. This may become a huge problem in clusters with high numbers of nodes or in multi-datacenter clusters where intercluster node-to-node communication may be expensive.
/kind bug
/priority important-soon