Graceful shutdown fails during rolling update of Kafka brokers (w/ kRaft) #11333

lmtjalves · 2025-04-08T21:45:45Z

lmtjalves
Apr 8, 2025

Hi!

I found an unexpected behavior that is not clear to me whether it is Strimzi or Kafka related. I'm testing graceful shutdowns of Kafka brokers as they were taking too long to complete in my Kafka deployment (~15 minutes per broker). After some investigation I found that the brokers graceful shutdown was being forcefully aborted by K8s as it was taking longer than default terminationGracePeriodSeconds. I increased terminationGracePeriodSeconds to 5 minutes and saw that graceful broker restart reduced to 2-3 minutes (using kubectl delete <broker>).

Now, I did a rolling update of the Kafka brokers but I'm still seeing each broker taking ~15 minutes to restart. In this case, I've tested triggering rolling updates by changing some Kafka broker configurations and also by adding the strimzi.io/manual-rolling-update annotation.

Any idea why a Kafka broker restart via kubectl delete takes 2-3 minutes, whereas if the restart is managed by Strimzi it is still taking 15 minutes?

Some details about my setup (let me know if more information is needed):

Strimzi 0.40.0.
Kafka 3.7.0.
Kafka is configured with KRaft. I have 3 controller nodes (with ids 8, 9 and 10) and 9 brokers.
Controller nodes have 2 CPU and 10GB memory (I don't see any bottleneck).
Broker nodes have 16 CPU, 52GB memory. Each broker has 1TB disk (AWS gp3 with 125MiB throughput), and is currently with 400 GB disk usage. I do see disk throughput being exhausted during the Kafka broker recovery (when it takes 15 minutes). JVM has 6GB heap. XFS as filesystem.
Cluster has 164 topics, with a total of 5818 partitions and 7783 consumer groups. In our use case, we manually assign consumers to partitions rather than relying on Kafka for this assignment.
We are processing ~22k records/sec during the tests.

Kafka brokers have the following configuration overrides (everything else is the default from Strimzi):

offsets.topic.replication.factor: 3
transaction.state.log.replication.factor: 3
transaction.state.log.min.isr: 2
default.replication.factor: 3
min.insync.replicas: 2
num.replica.fetchers: 4
log.retention.check.interval.ms: 60000
num.network.threads: 8
num.io.threads: 16
num.recovery.threads.per.data.dir: 16

Observations of what happens when Kafka brokers take 15 minutes to restart during a rolling restart managed by Strimzi:

The Kafka brokers are taking longer than 5 minutes to restart, hit the terminationGracePeriodSeconds and get a SIGKILL, which explains why the longer recovery time during start.
During the shutdown of the broker I see the following the logs below in the Kafka broker. I removed some of the duplicated logs, but I see lots of "Unable to send a heartbeat because the RPC", "Recorded new controller", and "Disconnecting from node 10 due to socket connection setup timeout" logs. I don't see these logs during "manual" graceful broker restarts (kubectl delete) that take 2-3 minutes.

2025-04-08 17:35:55.167	2025-04-08 16:35:55,167 INFO [ReplicaFetcherThread-0-2]: Shutting down (kafka.server.ReplicaFetcherThread) [kafka-1-metadata-loader-event-handler]
2025-04-08 17:35:55.168	2025-04-08 16:35:55,168 INFO [ReplicaFetcher replicaId=1, leaderId=2, fetcherId=0] Client requested connection close from node 2 (org.apache.kafka.clients.NetworkClient) [ReplicaFetcherThread-0-2]
2025-04-08 17:35:55.168	2025-04-08 16:35:55,168 INFO [ReplicaFetcher replicaId=1, leaderId=2, fetcherId=0] Cancelled in-flight FETCH request with correlation id 1112986 due to node 2 being disconnected (elapsed time since creation: 156ms, elapsed time since send: 156ms, request timeout: 30000ms) (org.apache.kafka.clients.NetworkClient) [ReplicaFetcherThread-0-2]
2025-04-08 17:35:55.173	2025-04-08 16:35:55,169 INFO [ReplicaFetcher replicaId=1, leaderId=2, fetcherId=0] Error sending fetch request (sessionId=1975922505, epoch=1112986) to node 2: (org.apache.kafka.clients.FetchSessionHandler) [ReplicaFetcherThread-0-2]
2025-04-08 17:35:55.173	java.io.IOException: Client was shutdown before response was read
2025-04-08 17:35:55.173		at org.apache.kafka.clients.NetworkClientUtils.sendAndReceive(NetworkClientUtils.java:109)
2025-04-08 17:35:55.173		at kafka.server.BrokerBlockingSender.sendRequest(BrokerBlockingSender.scala:113)
2025-04-08 17:35:55.173		at kafka.server.RemoteLeaderEndPoint.fetch(RemoteLeaderEndPoint.scala:79)
2025-04-08 17:35:55.173		at kafka.server.AbstractFetcherThread.processFetchRequest(AbstractFetcherThread.scala:317)
2025-04-08 17:35:55.173		at kafka.server.AbstractFetcherThread.$anonfun$maybeFetch$3(AbstractFetcherThread.scala:131)
2025-04-08 17:35:55.173		at kafka.server.AbstractFetcherThread.$anonfun$maybeFetch$3$adapted(AbstractFetcherThread.scala:130)
2025-04-08 17:35:55.173		at scala.Option.foreach(Option.scala:437)
2025-04-08 17:35:55.173		at kafka.server.AbstractFetcherThread.maybeFetch(AbstractFetcherThread.scala:130)
2025-04-08 17:35:55.173		at kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:113)
2025-04-08 17:35:55.173		at kafka.server.ReplicaFetcherThread.doWork(ReplicaFetcherThread.scala:98)
2025-04-08 17:35:55.173		at org.apache.kafka.server.util.ShutdownableThread.run(ShutdownableThread.java:131)
2025-04-08 17:35:55.173	2025-04-08 16:35:55,173 INFO [ReplicaFetcherThread-0-2]: Stopped (kafka.server.ReplicaFetcherThread) [ReplicaFetcherThread-0-2]
2025-04-08 17:35:55.173	2025-04-08 16:35:55,173 INFO [ReplicaFetcherThread-0-2]: Shutdown completed (kafka.server.ReplicaFetcherThread) [kafka-1-metadata-loader-event-handler]
2025-04-08 17:35:55.174	2025-04-08 16:35:55,174 INFO [ReplicaFetcherThread-1-7]: Shutting down (kafka.server.ReplicaFetcherThread) [kafka-1-metadata-loader-event-handler]
2025-04-08 17:35:55.174	2025-04-08 16:35:55,174 INFO [ReplicaFetcher replicaId=1, leaderId=7, fetcherId=1] Client requested connection close from node 7 (org.apache.kafka.clients.NetworkClient) [ReplicaFetcherThread-1-7]

...

2025-04-08 17:35:55.198	2025-04-08 16:35:55,198 INFO [ReplicaFetcherThread-2-2]: Shutdown completed (kafka.server.ReplicaFetcherThread) [kafka-1-metadata-loader-event-handler]
2025-04-08 17:35:59.112	2025-04-08 16:35:59,112 INFO [RaftManager id=1] Disconnecting from node 8 due to request timeout. (org.apache.kafka.clients.NetworkClient) [kafka-1-raft-outbound-request-thread]
2025-04-08 17:35:59.565	2025-04-08 16:35:59,565 INFO [NodeToControllerChannelManager id=1 name=heartbeat] Disconnecting from node 8 due to request timeout. (org.apache.kafka.clients.NetworkClient) [broker-1-to-controller-heartbeat-channel-manager]
2025-04-08 17:35:59.565	2025-04-08 16:35:59,565 INFO [NodeToControllerChannelManager id=1 name=heartbeat] Cancelled in-flight BROKER_HEARTBEAT request with correlation id 10953 due to node 8 being disconnected (elapsed time since creation: 4504ms, elapsed time since send: 4504ms, request timeout: 4500ms) (org.apache.kafka.clients.NetworkClient) [broker-1-to-controller-heartbeat-channel-manager]
2025-04-08 17:35:59.565	2025-04-08 16:35:59,565 INFO [broker-1-to-controller-heartbeat-channel-manager]: Recorded new controller, from now on will use node kafka-controller-8.kafka-brokers.namespace.svc.cluster.local:9090 (id: 8 rack: null) (kafka.server.NodeToControllerRequestThread) [broker-1-to-controller-heartbeat-channel-manager]
2025-04-08 17:35:59.566	2025-04-08 16:35:59,566 INFO [BrokerLifecycleManager id=1] Unable to send a heartbeat because the RPC got timed out before it could be sent. (kafka.server.BrokerLifecycleManager) [broker-1-to-controller-heartbeat-channel-manager]
2025-04-08 17:36:04.174	2025-04-08 16:36:04,174 INFO [broker-1-to-controller-heartbeat-channel-manager]: Recorded new controller, from now on will use node kafka-controller-8.kafka-brokers.namespace.svc.cluster.local:9090 (id: 8 rack: null) (kafka.server.NodeToControllerRequestThread) [broker-1-to-controller-heartbeat-channel-manager]

...

2025-04-08 17:37:22.334	2025-04-08 16:37:22,334 INFO Deleted log /var/lib/kafka/data-0/kafka-log1/topic.E0U6dSAQmW0urOpH3ZMiAYdhX9X5iOQSWEKf4IvqMUZ1qHwLAlQcgEpEeD6ILJRd.out-2/00000000000178735494.log.deleted. (org.apache.kafka.storage.internals.log.LogSegment) [kafka-scheduler-1]

...

2025-04-08 17:38:48.621	2025-04-08 16:38:48,621 INFO [RaftManager id=1] Disconnecting from node 10 due to socket connection setup timeout. The timeout value is 29086 ms. (org.apache.kafka.clients.NetworkClient) [kafka-1-raft-outbound-request-thread]
2025-04-08 17:38:53.127	2025-04-08 16:38:53,127 INFO [RaftManager id=1] Disconnecting from node 8 due to socket connection setup timeout. The timeout value is 26145 ms. (org.apache.kafka.clients.NetworkClient) [kafka-1-raft-outbound-request-thread]
2025-04-08 17:38:55.129	2025-04-08 16:38:55,129 INFO [RaftManager id=1] Disconnecting from node 9 due to socket connection setup timeout. The timeout value is 24123 ms. (org.apache.kafka.clients.NetworkClient) [kafka-1-raft-outbound-request-thread]

...

2025-04-08 17:40:54.969	2025-04-08 16:40:54,969 ERROR [BrokerServer id=1] Timed out waiting for the controller to approve controlled shutdown (kafka.server.BrokerServer) [kafka-shutdown-hook]
2025-04-08 17:40:54.969	2025-04-08 16:40:54,969 INFO [BrokerLifecycleManager id=1] beginShutdown: shutting down event queue. (org.apache.kafka.queue.KafkaEventQueue) [kafka-shutdown-hook]
2025-04-08 17:40:54.970	2025-04-08 16:40:54,970 INFO [BrokerLifecycleManager id=1] Transitioning from PENDING_CONTROLLED_SHUTDOWN to SHUTTING_DOWN. (kafka.server.BrokerLifecycleManager) [broker-1-lifecycle-manager-event-handler]
2025-04-08 17:40:54.970	2025-04-08 16:40:54,970 INFO [broker-1-to-controller-heartbeat-channel-manager]: Shutting down (kafka.server.NodeToControllerRequestThread) [broker-1-lifecycle-manager-event-handler]
2025-04-08 17:40:54.970	2025-04-08 16:40:54,970 INFO [broker-1-to-controller-heartbeat-channel-manager]: Stopped (kafka.server.NodeToControllerRequestThread) [broker-1-to-controller-heartbeat-channel-manager]
2025-04-08 17:40:54.970	2025-04-08 16:40:54,970 INFO [broker-1-to-controller-heartbeat-channel-manager]: Shutdown completed (kafka.server.NodeToControllerRequestThread) [broker-1-lifecycle-manager-event-handler]
2025-04-08 17:40:54.971	2025-04-08 16:40:54,971 INFO [SocketServer listenerType=BROKER, nodeId=1] Stopping socket server request processors (kafka.network.SocketServer) [kafka-shutdown-hook]

On the Kafka controllers I see the following logs that also seem relevant:

2025-04-08 17:35:54.969	2025-04-08 16:35:54,969 INFO [QuorumController id=8] Unfenced broker 1 has requested and been granted a controlled shutdown. (org.apache.kafka.controller.BrokerHeartbeatManager) [quorum-controller-8-event-handler]
2025-04-08 17:35:54.975	2025-04-08 16:35:54,975 INFO [QuorumController id=8] enterControlledShutdown[1]: changing 808 partition(s) (org.apache.kafka.controller.ReplicationControlManager) [quorum-controller-8-event-handler]
2025-04-08 17:35:54.975	2025-04-08 16:35:54,975 INFO [QuorumController id=8] Replayed BrokerRegistrationChangeRecord modifying the registration for broker 1: BrokerRegistrationChangeRecord(brokerId=1, brokerEpoch=28128838, fenced=0, inControlledShutdown=1, logDirs=[]) (org.apache.kafka.controller.ClusterControlManager) [quorum-controller-8-event-handler]
2025-04-08 17:36:03.975	2025-04-08 16:36:03,975 INFO [QuorumController id=8] Fencing broker 1 because its session has timed out. (org.apache.kafka.controller.ReplicationControlManager) [quorum-controller-8-event-handler]
2025-04-08 17:36:03.975	2025-04-08 16:36:03,975 INFO [QuorumController id=8] Replayed BrokerRegistrationChangeRecord modifying the registration for broker 1: BrokerRegistrationChangeRecord(brokerId=1, brokerEpoch=28128838, fenced=1, inControlledShutdown=0, logDirs=[]) (org.apache.kafka.controller.ClusterControlManager) [quorum-controller-8-event-handler]

Thanks.

scholzj · 2025-04-08T23:20:47Z

scholzj
Apr 8, 2025
Maintainer

That the Pod is killed after the termination grace period is expected. This is how it works. Different nodes migth take different time to shutdown - that is at the end also why it is also configurable.

But I think shutting down for 15 minutes indicates some kind of issue that you would need to investigate. One of the things that could cause so long shutdowns is when the shutdown is started during a recovery after an unclean shutdown (as Kafka will not shutdown before completing the recovery). This in a way can create a cycle when one unclean shutdown is causing more unclean shutdowns due to the recovery. Your logs do not suggest this is the case as they do not list any of the recovery logs. But they are also not complete. So I cannot say that this is not the issue here for sure.

Given you say that you use KRaft and Kafka 3.7.0, I would suggest you to also consider upgrade as KRaft in 3.7 has still many missing features and issues. There is always chance this is some bug that migth be fixed in newer version.

11 replies

scholzj Apr 9, 2025
Maintainer

It looks like the long shutdown has some issues communicating with some other nodes. But I don't see anything that would be the reason :-(.

lmtjalves Apr 10, 2025
Author

Hey! Just a follow up on this.

We managed to pinpoint that during the Kafka broker restarts (via strimzi.io/manual-rolling-update) and while the broker is in a terminating state, the network requests no longer pass the network policies that we have in place. Which explains the timeouts in the logs between the brokers and the controllers. Indeed, if we add an allow all network policy this problem doesn't happen.

We are still trying to pinpoint why, e.g. we don't see the broker pod labels changing when the pod transitions to this state.

scholzj Apr 10, 2025
Maintainer

The network policies are "updated" shortsly before the manual rolling update:

strimzi-kafka-operator/cluster-operator/src/main/java/io/strimzi/operator/cluster/operator/assembly/KafkaReconciler.java

Lines 252 to 254 in 80b1e8b

    
           .compose(i -> networkPolicy()) 
        
           .compose(i -> updateKafkaAutoRebalanceStatus(kafkaStatus)) 
        
           .compose(i -> manualRollingUpdate())

... but I say "updated" because normally there is no change to them - especially to the internal ports used for replication and talking to controllers that have the same settings all the time.

lmtjalves Apr 17, 2025
Author

Hi @scholzj,
Just to leave the update that the issue is indeed related with that update of the network policy. Even though nothing changed, there is an impact on how Cilium handles this update, that causes the broker to not be able to connect with the controller nodes during the termination.

The solution we will move forward with is to disable the option for Strimzi to automatically create this network policy, and create our own instead.

scholzj Apr 17, 2025
Maintainer

Sounds like some bug in Cilium. But yes, the option to disable them should be an option. When you disable it, the original network policies will remain in place. So you will only need to update them if you change your cluster - e.g. add a listener etc.

jairogubler · 2025-07-25T12:06:54Z

jairogubler
Jul 25, 2025

I have a similar scenario here on GKE (Google Cloud Platform).

I started using strimzi 0.38 / Kafka 3.6 with ZK, and the brokers were taking a long time to restart (more than 30 minutes).
I created a new cluster using strimzi 0.42 / Kafka 3.7.1 with Kraft, and yesterday, when upgrading to strimzi 0.47 / Kafka 3.9.1, I encountered the same issue.
All brokers timed out.

INFO 2025-07-24T22:11:05.699512973Z [resource.labels.containerName: kafka] 2025-07-24 22:11:05,698 ERROR [BrokerServer id=6] Timed out waiting for the controller to approve controlled shutdown (kafka.server.BrokerServer) [kafka-shutdown-hook]
INFO 2025-07-24T21:58:19.384324085Z [resource.labels.containerName: kafka] 2025-07-24 21:58:19,383 ERROR [BrokerServer id=2] Timed out waiting for the controller to approve controlled shutdown (kafka.server.BrokerServer) [kafka-shutdown-hook]
INFO 2025-07-24T21:50:14.002222111Z [resource.labels.containerName: kafka] 2025-07-24 21:50:14,001 ERROR [BrokerServer id=2] Timed out waiting for the controller to approve controlled shutdown (kafka.server.BrokerServer) [kafka-shutdown-hook]
INFO 2025-07-24T21:45:13.706016492Z [resource.labels.containerName: kafka] 2025-07-24 21:45:13,703 ERROR [BrokerServer id=5] Timed out waiting for the controller to approve controlled shutdown (kafka.server.BrokerServer) [kafka-shutdown-hook]
INFO 2025-07-24T21:36:38.479547901Z [resource.labels.containerName: kafka] 2025-07-24 21:36:38,478 ERROR [BrokerServer id=4] Timed out waiting for the controller to approve controlled shutdown (kafka.server.BrokerServer) [kafka-shutdown-hook]
INFO 2025-07-24T21:31:36.878165027Z [resource.labels.containerName: kafka] 2025-07-24 21:31:36,877 ERROR [BrokerServer id=1] Timed out waiting for the controller to approve controlled shutdown (kafka.server.BrokerServer) [kafka-shutdown-hook]

This results in a very long restart time.

terminationGracePeriodSeconds configured with 300 seconds.

I also noticed that while one broker was still starting (not ready), another broker was already shutting down, which, in my case, can lead to topic creation failures.

Any initiatives to resolve this scenario?
Or should I assume that strimzi 0.47 / Kafka 3.9.1 has resolved this issue?

Environment:
GKE 1.30
Strimzi 0.47
3 controllers
6 brokers (150 GiB HDD each, with 10GiB used)

5 replies

katheris Jul 28, 2025
Collaborator

Hey @jairogubler

Any initiatives to resolve this scenario?
Or should I assume that strimzi 0.47 / Kafka 3.9.1 has resolved this issue?

What scenario are you referring to? The scenario where Strimzi updates the network policies shortly before doing the manual rolling update? There is no plans to change that

jairogubler Jul 28, 2025

I'm referring to the fact that the broker doesn't finish normally, because it can't communicate with the controllers during the termination.
As I said, I'm using Strimzi since release 0.38, and the restart procedure was always extremely slow (more than an hour).

scholzj Jul 28, 2025
Maintainer

If your brokers take more than 1 hour to restart, you likely have some major issues. It was not expected with Strimzi 0.38, neither with 0.42 and it is not expected with 0.47. There was no fix for that, because it is not a known issue, and it does not seem to be an issue affecting other users. So asking if someone resolved an isssue they ddi not know about does not make much sense.

I'm also not sure why did you decided to attach your comment to this discussion. Did you tried if your problem is the same? Do you use Cilium? If you do not have the same problem, starting a new discussion and sharing your full configurations and logs would be probbaly the right approach.

jairogubler Jul 28, 2025

I'm using GKE, and cilium is installed.

Sorry about posting here, but the initial description has a similar scenario as mine.

scholzj Jul 28, 2025
Maintainer

Well, it is a discussion which has been more of less resolved. So if you use Cilium, it clearly might be applicable to you. But you can then test yourself whether the previous fix solves it for you or not. If it does not help, you might have a different issue after all.

Strimzi

Graceful shutdown fails during rolling update of Kafka brokers (w/ kRaft) #11333

Uh oh!

Uh oh!

lmtjalves Apr 8, 2025

Replies: 2 comments · 16 replies

Uh oh!

scholzj Apr 8, 2025 Maintainer

Uh oh!

scholzj Apr 9, 2025 Maintainer

Uh oh!

lmtjalves Apr 10, 2025 Author

Uh oh!

scholzj Apr 10, 2025 Maintainer

Uh oh!

lmtjalves Apr 17, 2025 Author

Uh oh!

scholzj Apr 17, 2025 Maintainer

Uh oh!

Uh oh!

jairogubler Jul 25, 2025

Uh oh!

katheris Jul 28, 2025 Collaborator

Uh oh!

jairogubler Jul 28, 2025

Uh oh!

scholzj Jul 28, 2025 Maintainer

Uh oh!

jairogubler Jul 28, 2025

Uh oh!

scholzj Jul 28, 2025 Maintainer

lmtjalves
Apr 8, 2025

Replies: 2 comments 16 replies

scholzj
Apr 8, 2025
Maintainer

scholzj Apr 9, 2025
Maintainer

lmtjalves Apr 10, 2025
Author

scholzj Apr 10, 2025
Maintainer

lmtjalves Apr 17, 2025
Author

scholzj Apr 17, 2025
Maintainer

jairogubler
Jul 25, 2025

katheris Jul 28, 2025
Collaborator

scholzj Jul 28, 2025
Maintainer

scholzj Jul 28, 2025
Maintainer