-
Notifications
You must be signed in to change notification settings - Fork 54
Prevent graceful shutdown on split brain #2277
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prevent graceful shutdown on split brain #2277
Conversation
Skipping CI for Draft Pull Request. |
0ca720f
to
81914bb
Compare
69899e3
to
2c38db6
Compare
|
||
health, err := podMeta.client.Container().HealthStatus() | ||
if err != nil { | ||
ctx.Requeue(fmt.Errorf("unable to retrieve cluster health status for pod '%s': %w", pod.Name, err)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case, health is the new endpoint that only verifies if the process is alive?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's currently calling /health/status
and using the cluster_health
field which is determined by https://github.com/infinispan/infinispan/blob/3caa2cdf0dc191603067ab32353f072a966125a7/core/src/main/java/org/infinispan/health/impl/ClusterHealthImpl.java#L33-L41.
I think this should remain using this endpoint as this way we ensure we don't proceed with an upgrade if there is one or more DEGRADED caches.
So maybe we shouldn't deprecate container/health/status
after all?
@Crumby Setting the cache to AVAILABLE explicitly seems to be working, from https://github.com/infinispan/infinispan-operator/actions/runs/15118659859/job/42495774879?pr=2277: 2025-05-19 17:14:03 INFO upgrade/common.go:283 Setting org.infinispan.LOCKS cache to AVAILABLE for Operand Upstream=14.0.9, Downstream=<nil>, Image=quay.io/infinispan/server:14.0.9.Final, CVE=false, Deprecated=false I realised that I was passing the wrong operand though, which is fixed in the latest commit. |
6fa42fd
to
3021317
Compare
@@ -141,18 +142,80 @@ func GracefulShutdown(i *ispnv1.Infinispan, ctx pipeline.Context) { | |||
return | |||
} | |||
|
|||
var rebalanceDisabled bool | |||
type PodMeta struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we name it ServerMeta
? I think PodMeta
is often used in relation to k8s Pod Metadata and I'd expect the information from those in here. May confuse us in the future
// If any of the caches are not marked as HEALTHY we must prevent a GracefulShutdown to prevent | ||
// the cluster from entering an unexpected state | ||
if health != ispnApi.HealthStatusHealthy { | ||
ctx.Requeue(fmt.Errorf("unable to proceed with GracefulShutdown as the cluster health is '%s'", health)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Can we point out which caches are causing issues here to point out to user where to look?
- I think we should propagate the state to the Infinispan Status and generate warn events for the users to see. The user may not have access to Operator logs. (The same applies for ClusterSize status)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have updated the code to fire a warning event when the upgrade is blocked.
Can we point out which caches are causing issues here to point out to user where to look?
That should be possible if we retrieve the individual cache status from the /rest/v2/container/health
endpoint.
2. I think we should propagate the state to the Infinispan Status and generate warn events for the users to see. The user may not have access to Operator logs. (The same applies for ClusterSize status)
I believe that's beyond the scope of this PR, I have created #2282 to track this.
@ryanemerson Tests are completely missing here, I wonder whether we could create two:
|
8c67f22
to
b98c3b8
Compare
Both of these would be good to have. The problem is that until we have #2282 there's no straightforward way to determine the Operator state in the tests as we execute the operator locally in it's own go-routine.
It's possible to manually set a cache to DEGRADED via the rest API: https://infinispan.org/docs/stable/titles/rest/rest.html#rest_v2_caches_set_availability |
b98c3b8
to
8f88b29
Compare
Writing the last comment made me re-think the problem 🙂 We already have the I have also added a test for the latter case. |
8f88b29
to
fcc02ff
Compare
continue | ||
} | ||
ctx.Requeue(fmt.Errorf("unable to create Infinispan client for cluster being upgraded: %w", err)) | ||
ctx.Requeue(fmt.Errorf("unable to create Infinispan client to determine if split-brain is present: %w", err)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the message here be more generic? As far as I understand the code we are trying to get the client to get various information. Not just split brain
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update to "unable to create Infinispan client to determine if cluster is healthy: %w"
fcc02ff
to
e3eeb88
Compare
…or any cache is not in a HEALTHY state - Make org.infinispan.LOCKS cache as AVAILABLE automatically in upgrade testsuite to workaround known issues such as: https://issues.redhat.com/browse/ISPN-15191
… container info on startup
avoid known clustering issues causing flakiness Known issues: - https://issues.redhat.com/browse/ISPN-14782 - https://issues.redhat.com/browse/ISPN-14793 - https://issues.redhat.com/browse/ISPN-15151 - https://issues.redhat.com/browse/ISPN-15191 - https://issues.redhat.com/browse/ISPN-15357
e3eeb88
to
fe163b8
Compare
Thanks for reviewing @Crumby |
Closes #2273