Zookeeper pods consistently end up in CrashLoopBackoff #6466

jutley · 2022-03-03T18:26:52Z

jutley
Mar 3, 2022

We are evaluating using Strimzi for running Kafka. Overall it is looking good, but we have a very critical issue that we haven't been able to figure out. Over the lifetime of each of our clusters, Zookeeper will periodically get into a state where every pod is in a CrashLoopBackoff. Eventually they start running, but later they inevitably get back into this state.

The logs are fairly consistent and seem to show issues with Zookeeper pods being unable to connect to each other. This ends in a caught NullPointerException before the process exits with exit code 0. My best guess is that there is an issue with the liveness checks killing pods before all pods are up and able to communicate. I am working on testing this, but I won't know the outcome until its been running for long enough for the issue to reoccur.

We are running Strimzi 0.28.0 with Kafka 3.1.0 on Kubernetes 1.21.2.

Here are the relevant logs, with stack traces trimmed out after the first line.

2022-03-03 18:08:50,407 INFO Notification time out: 6400 (org.apache.zookeeper.server.quorum.FastLeaderElection) [QuorumPeer[myid=1](plain=127.0.0.1:12181)(secure=0.0.0.0:2181)]
2022-03-03 18:08:50,408 WARN Cannot open channel to 2 at election address logs-zookeeper-1.logs-zookeeper-nodes.strimzi.svc/10.24.11.215:3888 (org.apache.zookeeper.server.quorum.QuorumCnxManager) [QuorumConnectionThread-[myid=1]-3]
java.net.ConnectException: Connection refused (Connection refused)
	at java.base/java.net.PlainSocketImpl.socketConnect(Native Method)
        ...
2022-03-03 18:08:50,409 WARN Cannot open channel to 3 at election address logs-zookeeper-2.logs-zookeeper-nodes.strimzi.svc/10.24.72.92:3888 (org.apache.zookeeper.server.quorum.QuorumCnxManager) [QuorumConnectionThread-[myid=1]-2]
java.net.ConnectException: Connection refused (Connection refused)
	at java.base/java.net.PlainSocketImpl.socketConnect(Native Method)
        ...
2022-03-03 18:08:50,509 WARN Exception caught (org.apache.zookeeper.server.NettyServerCnxnFactory) [nioEventLoopGroup-4-1]
java.lang.NullPointerException
	at org.apache.zookeeper.server.NettyServerCnxnFactory$CnxnChannelHandler.channelActive(NettyServerCnxnFactory.java:260)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelActive(AbstractChannelHandlerContext.java:230)
        ...
2022-03-03 18:08:50,509 WARN Exception caught (org.apache.zookeeper.server.NettyServerCnxnFactory) [nioEventLoopGroup-4-2]
java.lang.NullPointerException
	at org.apache.zookeeper.server.NettyServerCnxnFactory$CnxnChannelHandler.channelActive(NettyServerCnxnFactory.java:260)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelActive(AbstractChannelHandlerContext.java:230)
        ...
2022-03-03 18:08:56,809 INFO Notification time out: 12800 (org.apache.zookeeper.server.quorum.FastLeaderElection) [QuorumPeer[myid=1](plain=127.0.0.1:12181)(secure=0.0.0.0:2181)]
2022-03-03 18:08:56,810 WARN Cannot open channel to 2 at election address logs-zookeeper-1.logs-zookeeper-nodes.strimzi.svc/10.24.11.215:3888 (org.apache.zookeeper.server.quorum.QuorumCnxManager) [QuorumConnectionThread-[myid=1]-2]
java.net.ConnectException: Connection refused (Connection refused)
	at java.base/java.net.PlainSocketImpl.socketConnect(Native Method)
        ...
2022-03-03 18:08:56,810 WARN Cannot open channel to 3 at election address logs-zookeeper-2.logs-zookeeper-nodes.strimzi.svc/10.24.72.92:3888 (org.apache.zookeeper.server.quorum.QuorumCnxManager) [QuorumConnectionThread-[myid=1]-3]
java.net.ConnectException: Connection refused (Connection refused)
	at java.base/java.net.PlainSocketImpl.socketConnect(Native Method)
        ...
2022-03-03 18:08:58,907 WARN Exception caught (org.apache.zookeeper.server.NettyServerCnxnFactory) [nioEventLoopGroup-4-3]
java.lang.NullPointerException
	at org.apache.zookeeper.server.NettyServerCnxnFactory$CnxnChannelHandler.channelActive(NettyServerCnxnFactory.java:260)
        ...

scholzj · 2022-03-03T18:40:31Z

scholzj
Mar 3, 2022
Maintainer

My best guess is that there is an issue with the liveness checks killing pods before all pods are up and able to communicate.

If this would be the case, it would show only on startup. If it happens while it is running, that sounds more like some networking issues from this part of the log?

0 replies

jutley · 2022-03-08T22:21:16Z

jutley
Mar 8, 2022
Author

Ah, it turns out my zookeeper pods were just running out of memory. Easy fix!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Strimzi

Zookeeper pods consistently end up in CrashLoopBackoff #6466

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Strimzi

Zookeeper pods consistently end up in CrashLoopBackoff #6466

Uh oh!

jutley Mar 3, 2022

Replies: 2 comments

Uh oh!

scholzj Mar 3, 2022 Maintainer

Uh oh!

jutley Mar 8, 2022 Author

jutley
Mar 3, 2022

scholzj
Mar 3, 2022
Maintainer

jutley
Mar 8, 2022
Author