Prevent graceful shutdown on split brain #2277

ryanemerson · 2025-05-16T15:10:00Z

openshift-ci · 2025-05-16T15:10:04Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

jabolina · 2025-05-19T17:40:37Z

pkg/reconcile/pipeline/infinispan/handler/manage/upgrades.go

+
+					health, err := podMeta.client.Container().HealthStatus()
+					if err != nil {
+						ctx.Requeue(fmt.Errorf("unable to retrieve cluster health status for pod '%s': %w", pod.Name, err))


In this case, health is the new endpoint that only verifies if the process is alive?

It's currently calling /health/status and using the cluster_health field which is determined by https://github.com/infinispan/infinispan/blob/3caa2cdf0dc191603067ab32353f072a966125a7/core/src/main/java/org/infinispan/health/impl/ClusterHealthImpl.java#L33-L41.

I think this should remain using this endpoint as this way we ensure we don't proceed with an upgrade if there is one or more DEGRADED caches.

So maybe we shouldn't deprecate container/health/status after all?

ryanemerson · 2025-05-19T19:52:20Z

@Crumby Setting the cache to AVAILABLE explicitly seems to be working, from https://github.com/infinispan/infinispan-operator/actions/runs/15118659859/job/42495774879?pr=2277:

2025-05-19 17:14:03	INFO	upgrade/common.go:283	Setting org.infinispan.LOCKS cache to AVAILABLE for Operand Upstream=14.0.9, Downstream=<nil>, Image=quay.io/infinispan/server:14.0.9.Final, CVE=false, Deprecated=false

I realised that I was passing the wrong operand though, which is fixed in the latest commit.

test/e2e/upgrade/common.go

Crumby · 2025-05-20T09:09:11Z

pkg/reconcile/pipeline/infinispan/handler/manage/upgrades.go

@@ -141,18 +142,80 @@ func GracefulShutdown(i *ispnv1.Infinispan, ctx pipeline.Context) {
 					return
 				}

-				var rebalanceDisabled bool
+				type PodMeta struct {


Should we name it ServerMeta? I think PodMeta is often used in relation to k8s Pod Metadata and I'd expect the information from those in here. May confuse us in the future

Crumby · 2025-05-20T09:30:41Z

pkg/reconcile/pipeline/infinispan/handler/manage/upgrades.go

+					// If any of the caches are not marked as HEALTHY we must prevent a GracefulShutdown to prevent
+					// the cluster from entering an unexpected state
+					if health != ispnApi.HealthStatusHealthy {
+						ctx.Requeue(fmt.Errorf("unable to proceed with GracefulShutdown as the cluster health is '%s'", health))


Can we point out which caches are causing issues here to point out to user where to look?

I think we should propagate the state to the Infinispan Status and generate warn events for the users to see. The user may not have access to Operator logs. (The same applies for ClusterSize status)

I have updated the code to fire a warning event when the upgrade is blocked.

Can we point out which caches are causing issues here to point out to user where to look?

That should be possible if we retrieve the individual cache status from the /rest/v2/container/health endpoint.

2. I think we should propagate the state to the Infinispan Status and generate warn events for the users to see. The user may not have access to Operator logs. (The same applies for ClusterSize status)

I believe that's beyond the scope of this PR, I have created #2282 to track this.

Crumby · 2025-05-20T09:37:04Z

@ryanemerson Tests are completely missing here, I wonder whether we could create two:

One for split-brain scenario by introducing NetworkPolicy
One for degraded cache. This can be harder to achieve but it was possible to introduce degraded cache to the cluster by creating a cache with invalid configuration some time ago

ryanemerson · 2025-05-20T10:50:24Z

One for split-brain scenario by introducing NetworkPolicy

One for degraded cache. This can be harder to achieve but it was possible to introduce degraded cache to the cluster by creating a cache with invalid configuration some time ago

Both of these would be good to have. The problem is that until we have #2282 there's no straightforward way to determine the Operator state in the tests as we execute the operator locally in it's own go-routine.

This can be harder to achieve but it was possible to introduce degraded cache to the cluster by creating a cache with invalid configuration some time ago

It's possible to manually set a cache to DEGRADED via the rest API: https://infinispan.org/docs/stable/titles/rest/rest.html#rest_v2_caches_set_availability

ryanemerson · 2025-05-20T11:09:35Z

Writing the last comment made me re-think the problem 🙂

We already have the Stopping condition which is meant to reflect that the status of cluster scale down. So I have made it so that we keep this as False with an appropriate error message if the cluster can't scale down due to either a split-brain or a DEGRADED cache.

I have also added a test for the latter case.

pkg/reconcile/pipeline/infinispan/handler/manage/upgrades.go

Crumby · 2025-05-20T11:58:14Z

pkg/reconcile/pipeline/infinispan/handler/manage/upgrades.go

 							continue
 						}
-						ctx.Requeue(fmt.Errorf("unable to create Infinispan client for cluster being upgraded: %w", err))
+						ctx.Requeue(fmt.Errorf("unable to create Infinispan client to determine if split-brain is present: %w", err))


Should the message here be more generic? As far as I understand the code we are trying to get the client to get various information. Not just split brain

Update to "unable to create Infinispan client to determine if cluster is healthy: %w"

test/e2e/infinispan/upgrade_operand_test.go

…or any cache is not in a HEALTHY state - Make org.infinispan.LOCKS cache as AVAILABLE automatically in upgrade testsuite to workaround known issues such as: https://issues.redhat.com/browse/ISPN-15191

… container info on startup

avoid known clustering issues causing flakiness Known issues: - https://issues.redhat.com/browse/ISPN-14782 - https://issues.redhat.com/browse/ISPN-14793 - https://issues.redhat.com/browse/ISPN-15151 - https://issues.redhat.com/browse/ISPN-15191 - https://issues.redhat.com/browse/ISPN-15357

ryanemerson · 2025-05-20T15:34:06Z

Thanks for reviewing @Crumby

ryanemerson force-pushed the 2273/prevent_graceful_shutdown_on_split_brain branch from 0ca720f to 81914bb Compare May 19, 2025 09:45

ryanemerson marked this pull request as ready for review May 19, 2025 10:49

ryanemerson requested a review from Crumby May 19, 2025 10:49

ryanemerson force-pushed the 2273/prevent_graceful_shutdown_on_split_brain branch 2 times, most recently from 69899e3 to 2c38db6 Compare May 19, 2025 16:48

jabolina reviewed May 19, 2025

View reviewed changes

ryanemerson force-pushed the 2273/prevent_graceful_shutdown_on_split_brain branch from 6fa42fd to 3021317 Compare May 19, 2025 20:00