Unexpected higher latencies during rolling upgrade #8190

helenapoleri · 2023-03-03T13:46:24Z

helenapoleri
Mar 3, 2023

Describe the bug
We have a setup with 3 Kafka nodes. We have a system A producing records to Kafka, which are consumed by a system B. System B does some processing and produces response records to be consumed by system A.

We are experiencing unexpected higher latencies end-to-end latencies when using Strimzi and performing a rolling upgrade. We are using the default configurations.

End-to-end latency (measured at system A):

We have tried doing a manual restart of the Kafka pods (by killing the Kafka process PID (with a SIGTERM) for each pod and waiting for the latencies to stabilize) and we are not seeing the same behaviour.

End-to-end latency (measured at system A:

While it might be expected that during a rolling upgrade we see a spike in latencies, we were not expecting to see such a big difference between the manual restarts and the rolling upgrade.

To Reproduce
We are reproducing by just triggering a rolling upgrade (with no changes).

Expected behavior
We are expecting at least to have similar latencies to when we perform the restarts manually, but we also don't know whether this is expected behaviour using Kafka.

Environment (please complete the following information):

Strimzi version: 0.31.1
Installation method: Helm chart (via Flux)
Kubernetes cluster: 1.23+
Infrastructure: Amazon EKS

YAML files and logs

Kafka cluster:

apiVersion: kafka.strimzi.io/v1beta2
kind: Kafka
metadata:
  name: kafka
spec:
  kafka:
    version: 3.2.3
    replicas: 3
    config:
      offsets.topic.replication.factor: 3
      transaction.state.log.replication.factor: 3
      transaction.state.log.min.isr: 2
      default.replication.factor: 3
      min.insync.replicas: 2
      inter.broker.protocol.version: "3.2"
    template:
      pod:
        securityContext:
          runAsNonRoot: true
          runAsUser: 1001
          runAsGroup: 1001
          fsGroup: 1001
          fsGroupChangePolicy: "OnRootMismatch"
          seccompProfile:
            type: RuntimeDefault
      kafkaContainer:
        securityContext:
          runAsNonRoot: true
          runAsUser: 1001
          runAsGroup: 1001
          allowPrivilegeEscalation: false
          readOnlyRootFilesystem: true
          capabilities:
            drop:
              - all
    resources:
      requests:
        cpu: 5000m
        memory: 14336Mi
      limits:
        cpu: 5000m
        memory: 14336Mi
    storage:
      type: jbod
      volumes:
      - id: 0
        type: persistent-claim
        size: 20Gi
        deleteClaim: false
    listeners:
      - name: plain
        port: 9092
        type: internal
        tls: false
      - name: tls
        port: 9093
        type: internal
        tls: true
    metricsConfig:
      type: jmxPrometheusExporter
      valueFrom:
        configMapKeyRef:
          name: kafka-metrics
          key: kafka-metrics-config.yml
  zookeeper:
    replicas: 3
    template:
      pod:
        securityContext:
          runAsNonRoot: true
          runAsUser: 1001
          runAsGroup: 1001
          fsGroup: 1001
          fsGroupChangePolicy: "OnRootMismatch"
          seccompProfile:
            type: RuntimeDefault
      zookeeperContainer:
        securityContext:
          runAsNonRoot: true
          runAsUser: 1001
          runAsGroup: 1001
          allowPrivilegeEscalation: false
          readOnlyRootFilesystem: true
          capabilities:
            drop:
              - all
    resources:
      requests:
        cpu: 500m
        memory: 500Mi
      limits:
        cpu: 500m
        memory: 500Mi
    storage:
      type: persistent-claim
      size: 100Gi
      deleteClaim: false
    metricsConfig:
      type: jmxPrometheusExporter
      valueFrom:
        configMapKeyRef:
          name: kafka-metrics
          key: zookeeper-metrics-config.yml
  kafkaExporter:
    groupRegex: ".*"
    topicRegex: ".*"
    resources:
      requests:
        cpu: 200m
        memory: 100Mi
      limits:
        cpu: 200m
        memory: 100Mi
    template:
      pod:
        securityContext:
          runAsNonRoot: true
          runAsUser: 1001
          runAsGroup: 1001
          fsGroup: 1001
          fsGroupChangePolicy: "OnRootMismatch"
          seccompProfile:
            type: RuntimeDefault
      container:
        securityContext:
          runAsNonRoot: true
          runAsUser: 1001
          runAsGroup: 1001
          allowPrivilegeEscalation: false
          readOnlyRootFilesystem: true
          capabilities:
            drop:
              - all

Additional context

Kafka cluster has 3 nodes as you can see above. Our topic replication factor is 3 and minISR is 2. All of our topics only have one partition.
We are using Kafka default configurations for our producers. For our use case we do not need auto commits, so we have disabled that for our consumers. Otherwise, we are using the default configurations for our consumers, too.

scholzj · 2023-03-03T13:50:17Z

scholzj
Mar 3, 2023
Maintainer

I don't think nobody can discuss this without more details:

Your end-to-end latency has no meaning without understanding what it means and how do you measure it.
You need to provide the configurations ... for topics, for producers, for consumers
You need to provide state of the environment, what is in-sync, what is not in-sync etc.

Without that, there is not much to discuss. Every broker restart creates disruption, leaderships change, client reconnects happen, replicas have to catch up. So there will be always visible impact to it.

4 replies

helenapoleri Mar 3, 2023
Author

I don't think nobody can discuss this without more details:
* Your end-to-end latency has no meaning without understanding what it means and how do you measure it.

Here are some updated visualizations for the end-to-end latency, since the ones in the original post were taking into account some irrelevant logic.

This is when doing the rolling upgrade (in seconds):

And when performing the restarts manually (in seconds):

The end-to-end-latency includes:

The time that the producer in system A takes to produce records for a topic X;
The time that the system B consumer takes to consume those records from topic X;
The time that system B takes to process those records;
The time that the producer in system B takes to produce records for a topic Y;
The time that the system A consumer takes to consume those records from topic Y.

I understand that this end-to-end latency measures as well the logic it takes for system B to do whatever processing it does and produce the response for system A. However, that should not be relevant when the discussion is why there is such a discrepancy in latencies when using Strimzi to perform the rolling upgrade vs when we are not.

We also have some extra visualizations for another metric that measures the latency just for system A to produce the records for the topic X.

This is when doing the rolling upgrade (in seconds):

And when performing the restarts manually (in seconds):

* You need to provide the configurations ... for topics, for producers, for consumers

We have two topics. The first one (I9jK2aHZI5MXmWCMi4IjlFkoksz7SZg1FJl0ZhDqhIOjF1tpo3vFKgvkvZ6ha60M.input.in) is the one used by system A to produce requests for system B and the second one (I9jK2aHZI5MXmWCMi4IjlFkoksz7SZg1FJl0ZhDqhIOjF1tpo3vFKgvkvZ6ha60M.out) is the one used by system B to produce responses to system A.

I9jK2aHZI5MXmWCMi4IjlFkoksz7SZg1FJl0ZhDqhIOjF1tpo3vFKgvkvZ6ha60M.input.in:

  compression.type=producer sensitive=false synonyms={DEFAULT_CONFIG:compression.type=producer}
  leader.replication.throttled.replicas= sensitive=false synonyms={}
  message.downconversion.enable=true sensitive=false synonyms={DEFAULT_CONFIG:log.message.downconversion.enable=true}
  min.insync.replicas=2 sensitive=false synonyms={STATIC_BROKER_CONFIG:min.insync.replicas=2, DEFAULT_CONFIG:min.insync.replicas=1}
  segment.jitter.ms=0 sensitive=false synonyms={}
  cleanup.policy=delete sensitive=false synonyms={DEFAULT_CONFIG:log.cleanup.policy=delete}
  flush.ms=9223372036854775807 sensitive=false synonyms={}
  follower.replication.throttled.replicas= sensitive=false synonyms={}
  segment.bytes=1073741824 sensitive=false synonyms={DEFAULT_CONFIG:log.segment.bytes=1073741824}
  retention.ms=172800000 sensitive=false synonyms={DYNAMIC_TOPIC_CONFIG:retention.ms=172800000}
  flush.messages=9223372036854775807 sensitive=false synonyms={DEFAULT_CONFIG:log.flush.interval.messages=9223372036854775807}
  message.format.version=3.0-IV1 sensitive=false synonyms={STATIC_BROKER_CONFIG:log.message.format.version=3.2, DEFAULT_CONFIG:log.message.format.version=3.0-IV1}
  max.compaction.lag.ms=9223372036854775807 sensitive=false synonyms={DEFAULT_CONFIG:log.cleaner.max.compaction.lag.ms=9223372036854775807}
  file.delete.delay.ms=60000 sensitive=false synonyms={DEFAULT_CONFIG:log.segment.delete.delay.ms=60000}
  max.message.bytes=1048588 sensitive=false synonyms={DEFAULT_CONFIG:message.max.bytes=1048588}
  min.compaction.lag.ms=0 sensitive=false synonyms={DEFAULT_CONFIG:log.cleaner.min.compaction.lag.ms=0}
  message.timestamp.type=CreateTime sensitive=false synonyms={DEFAULT_CONFIG:log.message.timestamp.type=CreateTime}
  preallocate=false sensitive=false synonyms={DEFAULT_CONFIG:log.preallocate=false}
  min.cleanable.dirty.ratio=0.5 sensitive=false synonyms={DEFAULT_CONFIG:log.cleaner.min.cleanable.ratio=0.5}
  index.interval.bytes=4096 sensitive=false synonyms={DEFAULT_CONFIG:log.index.interval.bytes=4096}
  unclean.leader.election.enable=false sensitive=false synonyms={DEFAULT_CONFIG:unclean.leader.election.enable=false}
  retention.bytes=-1 sensitive=false synonyms={DEFAULT_CONFIG:log.retention.bytes=-1}
  delete.retention.ms=86400000 sensitive=false synonyms={DEFAULT_CONFIG:log.cleaner.delete.retention.ms=86400000}
  segment.ms=604800000 sensitive=false synonyms={}
  message.timestamp.difference.max.ms=9223372036854775807 sensitive=false synonyms={DEFAULT_CONFIG:log.message.timestamp.difference.max.ms=9223372036854775807}
  segment.index.bytes=10485760 sensitive=false synonyms={DEFAULT_CONFIG:log.index.size.max.bytes=10485760}

I9jK2aHZI5MXmWCMi4IjlFkoksz7SZg1FJl0ZhDqhIOjF1tpo3vFKgvkvZ6ha60M.out:

  compression.type=producer sensitive=false synonyms={DEFAULT_CONFIG:compression.type=producer}
  leader.replication.throttled.replicas= sensitive=false synonyms={}
  message.downconversion.enable=true sensitive=false synonyms={DEFAULT_CONFIG:log.message.downconversion.enable=true}
  min.insync.replicas=2 sensitive=false synonyms={STATIC_BROKER_CONFIG:min.insync.replicas=2, DEFAULT_CONFIG:min.insync.replicas=1}
  segment.jitter.ms=0 sensitive=false synonyms={}
  cleanup.policy=delete sensitive=false synonyms={DEFAULT_CONFIG:log.cleanup.policy=delete}
  flush.ms=9223372036854775807 sensitive=false synonyms={}
  follower.replication.throttled.replicas= sensitive=false synonyms={}
  segment.bytes=1073741824 sensitive=false synonyms={DEFAULT_CONFIG:log.segment.bytes=1073741824}
  retention.ms=600000 sensitive=false synonyms={DYNAMIC_TOPIC_CONFIG:retention.ms=600000}
  flush.messages=9223372036854775807 sensitive=false synonyms={DEFAULT_CONFIG:log.flush.interval.messages=9223372036854775807}
  message.format.version=3.0-IV1 sensitive=false synonyms={STATIC_BROKER_CONFIG:log.message.format.version=3.2, DEFAULT_CONFIG:log.message.format.version=3.0-IV1}
  max.compaction.lag.ms=9223372036854775807 sensitive=false synonyms={DEFAULT_CONFIG:log.cleaner.max.compaction.lag.ms=9223372036854775807}
  file.delete.delay.ms=60000 sensitive=false synonyms={DEFAULT_CONFIG:log.segment.delete.delay.ms=60000}
  max.message.bytes=1048588 sensitive=false synonyms={DEFAULT_CONFIG:message.max.bytes=1048588}
  min.compaction.lag.ms=0 sensitive=false synonyms={DEFAULT_CONFIG:log.cleaner.min.compaction.lag.ms=0}
  message.timestamp.type=CreateTime sensitive=false synonyms={DEFAULT_CONFIG:log.message.timestamp.type=CreateTime}
  preallocate=false sensitive=false synonyms={DEFAULT_CONFIG:log.preallocate=false}
  min.cleanable.dirty.ratio=0.5 sensitive=false synonyms={DEFAULT_CONFIG:log.cleaner.min.cleanable.ratio=0.5}
  index.interval.bytes=4096 sensitive=false synonyms={DEFAULT_CONFIG:log.index.interval.bytes=4096}
  unclean.leader.election.enable=false sensitive=false synonyms={DEFAULT_CONFIG:unclean.leader.election.enable=false}
  retention.bytes=-1 sensitive=false synonyms={DEFAULT_CONFIG:log.retention.bytes=-1}
  delete.retention.ms=86400000 sensitive=false synonyms={DEFAULT_CONFIG:log.cleaner.delete.retention.ms=86400000}
  segment.ms=604800000 sensitive=false synonyms={}
  message.timestamp.difference.max.ms=9223372036854775807 sensitive=false synonyms={DEFAULT_CONFIG:log.message.timestamp.difference.max.ms=9223372036854775807}
  segment.index.bytes=10485760 sensitive=false synonyms={DEFAULT_CONFIG:log.index.size.max.bytes=10485760

We have only one producer and one consumer on each side. System A has the producer for the first topic and consumer for the second topic. System B has the consumer for the first topic and producer for the second topic. As I said above, we are using the default configurations for both the consumers and producers, except that we have auto commit disabled for both the consumers. We are not logging the configs on system A, but they are similar to System B producer and consumer.

System B consumer:


allow.auto.create.topics = false
auto.commit.interval.ms = 5000
auto.include.jmx.reporter = true
auto.offset.reset = latest
bootstrap.servers = [kafka-kafka-bootstrap.default:9092]
check.crcs = true
client.dns.lookup = use_all_dns_ips
client.id = consumer-job-I9jK2aHZI5MXmWCMi4IjlFkoksz7SZg1FJl0ZhDqhIOjF1tpo3vFKgvkvZ6ha60M-replica-0-consumer-5
client.rack = 
connections.max.idle.ms = 540000
default.api.timeout.ms = 60000
enable.auto.commit = false
exclude.internal.topics = true
fetch.max.bytes = 52428800
fetch.max.wait.ms = 500
fetch.min.bytes = 1
group.id = job-I9jK2aHZI5MXmWCMi4IjlFkoksz7SZg1FJl0ZhDqhIOjF1tpo3vFKgvkvZ6ha60M-replica-0-consumer
group.instance.id = null
heartbeat.interval.ms = 3000
interceptor.classes = []
internal.leave.group.on.close = true
internal.throw.on.fetch.stable.offset.unsupported = false
isolation.level = read_uncommitted
key.deserializer = class org.apache.kafka.common.serialization.StringDeserializer
max.partition.fetch.bytes = 1048576
max.poll.interval.ms = 300000
max.poll.records = 500
metadata.max.age.ms = 300000
metric.reporters = []
metrics.num.samples = 2
metrics.recording.level = INFO
metrics.sample.window.ms = 30000
partition.assignment.strategy = [class org.apache.kafka.clients.consumer.RangeAssignor, class org.apache.kafka.clients.consumer.CooperativeStickyAssignor]
receive.buffer.bytes = 65536
reconnect.backoff.max.ms = 1000
reconnect.backoff.ms = 50
request.timeout.ms = 30000
retry.backoff.ms = 100
sasl.client.callback.handler.class = null
sasl.jaas.config = null
sasl.kerberos.kinit.cmd = /usr/bin/kinit
sasl.kerberos.min.time.before.relogin = 60000
sasl.kerberos.service.name = null
sasl.kerberos.ticket.renew.jitter = 0.05
sasl.kerberos.ticket.renew.window.factor = 0.8
sasl.login.callback.handler.class = null
sasl.login.class = null
sasl.login.connect.timeout.ms = null
sasl.login.read.timeout.ms = null
sasl.login.refresh.buffer.seconds = 300
sasl.login.refresh.min.period.seconds = 60
sasl.login.refresh.window.factor = 0.8
sasl.login.refresh.window.jitter = 0.05
sasl.login.retry.backoff.max.ms = 10000
sasl.login.retry.backoff.ms = 100
sasl.mechanism = GSSAPI
sasl.oauthbearer.clock.skew.seconds = 30
sasl.oauthbearer.expected.audience = null
sasl.oauthbearer.expected.issuer = null
sasl.oauthbearer.jwks.endpoint.refresh.ms = 3600000
sasl.oauthbearer.jwks.endpoint.retry.backoff.max.ms = 10000
sasl.oauthbearer.jwks.endpoint.retry.backoff.ms = 100
sasl.oauthbearer.jwks.endpoint.url = null
sasl.oauthbearer.scope.claim.name = scope
sasl.oauthbearer.sub.claim.name = sub
sasl.oauthbearer.token.endpoint.url = null
security.protocol = PLAINTEXT
security.providers = null
send.buffer.bytes = 131072
session.timeout.ms = 45000
socket.connection.setup.timeout.max.ms = 30000
socket.connection.setup.timeout.ms = 10000
ssl.cipher.suites = null
ssl.enabled.protocols = [TLSv1.2, TLSv1.3]
ssl.endpoint.identification.algorithm = https
ssl.engine.factory.class = null
ssl.key.password = null
ssl.keymanager.algorithm = SunX509
ssl.keystore.certificate.chain = null
ssl.keystore.key = null
ssl.keystore.location = null
ssl.keystore.password = null
ssl.keystore.type = JKS
ssl.protocol = TLSv1.3
ssl.provider = null
ssl.secure.random.implementation = null
ssl.trustmanager.algorithm = PKIX
ssl.truststore.certificates = null
ssl.truststore.location = null
ssl.truststore.password = null
ssl.truststore.type = JKS
value.deserializer = class com.server.api.serializers.kafka.EventDeserializer

System B producer:

acks = -1
auto.include.jmx.reporter = true
batch.size = 16384
bootstrap.servers = [kafka-kafka-bootstrap.default:9092]
buffer.memory = 33554432
client.dns.lookup = use_all_dns_ips
client.id = producer-5
compression.type = none
connections.max.idle.ms = 540000
delivery.timeout.ms = 120000
enable.idempotence = true
interceptor.classes = []
key.serializer = class org.apache.kafka.common.serialization.StringSerializer
linger.ms = 0
max.block.ms = 60000
max.in.flight.requests.per.connection = 5
max.request.size = 1048576
metadata.max.age.ms = 300000
metadata.max.idle.ms = 300000
metric.reporters = []
metrics.num.samples = 2
metrics.recording.level = INFO
metrics.sample.window.ms = 30000
partitioner.adaptive.partitioning.enable = true
partitioner.availability.timeout.ms = 0
partitioner.class = null
partitioner.ignore.keys = false
receive.buffer.bytes = 32768
reconnect.backoff.max.ms = 1000
reconnect.backoff.ms = 50
request.timeout.ms = 30000
retries = 2147483647
retry.backoff.ms = 100
sasl.client.callback.handler.class = null
sasl.jaas.config = null
sasl.kerberos.kinit.cmd = /usr/bin/kinit
sasl.kerberos.min.time.before.relogin = 60000
sasl.kerberos.service.name = null
sasl.kerberos.ticket.renew.jitter = 0.05
sasl.kerberos.ticket.renew.window.factor = 0.8
sasl.login.callback.handler.class = null
sasl.login.class = null
sasl.login.connect.timeout.ms = null
sasl.login.read.timeout.ms = null
sasl.login.refresh.buffer.seconds = 300
sasl.login.refresh.min.period.seconds = 60
sasl.login.refresh.window.factor = 0.8
sasl.login.refresh.window.jitter = 0.05
sasl.login.retry.backoff.max.ms = 10000
sasl.login.retry.backoff.ms = 100
sasl.mechanism = GSSAPI
sasl.oauthbearer.clock.skew.seconds = 30
sasl.oauthbearer.expected.audience = null
sasl.oauthbearer.expected.issuer = null
sasl.oauthbearer.jwks.endpoint.refresh.ms = 3600000
sasl.oauthbearer.jwks.endpoint.retry.backoff.max.ms = 10000
sasl.oauthbearer.jwks.endpoint.retry.backoff.ms = 100
sasl.oauthbearer.jwks.endpoint.url = null
sasl.oauthbearer.scope.claim.name = scope
sasl.oauthbearer.sub.claim.name = sub
sasl.oauthbearer.token.endpoint.url = null
security.protocol = PLAINTEXT
security.providers = null
send.buffer.bytes = 131072
socket.connection.setup.timeout.max.ms = 30000
socket.connection.setup.timeout.ms = 10000
ssl.cipher.suites = null
ssl.enabled.protocols = [TLSv1.2, TLSv1.3]
ssl.endpoint.identification.algorithm = https
ssl.engine.factory.class = null
ssl.key.password = null
ssl.keymanager.algorithm = SunX509
ssl.keystore.certificate.chain = null
ssl.keystore.key = null
ssl.keystore.location = null
ssl.keystore.password = null
ssl.keystore.type = JKS
ssl.protocol = TLSv1.3
ssl.provider = null
ssl.secure.random.implementation = null
ssl.trustmanager.algorithm = PKIX
ssl.truststore.certificates = null
ssl.truststore.location = null
ssl.truststore.password = null
ssl.truststore.type = JKS
transaction.timeout.ms = 60000
transactional.id = null
value.serializer = class com.server.api.serializers.kafka.ProfileStatesSerializer

* You need to provide state of the environment, what is in-sync, what is not in-sync etc.

Before the rolling upgrade:

After the rolling upgrade:

Without that, there is not much to discuss. Every broker restart creates disruption, leaderships change, client reconnects happen, 
replicas have to catch up. So there will be always visible impact to it.

We understand that is the case, but we have performed multiple tests with both approaches and we have always observed a great latency discrepancy (minutes, when doing the rolling upgrade vs seconds, when doing the restart). So the impact might be expected, but it still does not explain the big difference when using the different approaches.

scholzj Mar 3, 2023
Maintainer

TBH, all that Strimzi does is check the partitions to be in-sync and delete the pod. So the only difference would be in which broker you restart when. The impact of restarting the leader differs from restarting a follower - but if all you have is 2 topics with a single partition each, that is a pretty non-standard setup.

We have tried doing a manual restart of the Kafka pods (by killing the Kafka process PID (with a SIGTERM) for each pod and waiting for the latencies to stabilize) and we are not seeing the same behaviour.

If this means you exec into the pod and kill it from the command line, that is quite nasty. But it would make the difference in that the Pod is not recreated. That said, it is not something the operator should do.

helenapoleri Mar 7, 2023
Author

TBH, all that Strimzi does is check the partitions to be in-sync and delete the pod. So the only difference would be in which broker you restart when. The impact of restarting the leader differs from restarting a follower - but if all you have is 2 topics with a single partition each, that is a pretty non-standard setup.

For the manual tests, we have been careful to leave the controller pod to be the last to restart (which I believe is the recommended way of doing the rolling restarts). We have also made sure that we were performing graceful shutdowns (by checking the Kafka logs), so, having that said, should the impact of restarting the leader still differ from restarting a follower?

If this means you exec into the pod and kill it from the command line, that is quite nasty. But it would make the difference in that the Pod is not recreated. That said, it is not something the operator should do.

We know that this is not the ideal scenario (nor what the operator should do), it was just the way we used to see if we could take Strimzi out of the equation. After some digging and looking at the readiness probe logic (which to my understanding only waits for the broker to be in the RUNNING state), we think our problem might be related with issue #3749.

scholzj Mar 7, 2023
Maintainer

For the manual tests, we have been careful to leave the controller pod to be the last to restart (which I believe is the recommended way of doing the rolling restarts). We have also made sure that we were performing graceful shutdowns (by checking the Kafka logs), so, having that said, should the impact of restarting the leader still differ from restarting a follower?

The operator also restarts the controller last. So that is not different. Restart of a partition leader will always differ from restarting a follower as the producers are connected to the leader. So they will be disconnected, will need to find the new leader, connect to it and start sending messages. But the general expectation is that you will have hundreds or thousands of partitions on each node. So unlike the controller which is per cluster, you cannot optimize the rolling restart for the rolling the partition leader last.

We know that this is not the ideal scenario (nor what the operator should do), it was just the way we used to see if we could take Strimzi out of the equation. After some digging and looking at the readiness probe logic (which to my understanding only waits for the broker to be in the RUNNING state), we think our problem might be related with issue #3749.

The #3749 has nothing to do with this. It is just about when the pod shows as Ready - that has nothing to do with how you said you measure the latency. I think you should try if the latency differs when you roll the pod properly by kubectl delete. The time the pod will be down will be longer than when you just kill the process (but unlike killing the process it actually rolls the pod). So this can IMHO make some difference.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Strimzi

Unexpected higher latencies during rolling upgrade #8190

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 4 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Strimzi

Unexpected higher latencies during rolling upgrade #8190

Uh oh!

helenapoleri Mar 3, 2023

Replies: 1 comment · 4 replies

Uh oh!

scholzj Mar 3, 2023 Maintainer

Uh oh!

helenapoleri Mar 3, 2023 Author

Uh oh!

scholzj Mar 3, 2023 Maintainer

Uh oh!

helenapoleri Mar 7, 2023 Author

Uh oh!

scholzj Mar 7, 2023 Maintainer

helenapoleri
Mar 3, 2023

Replies: 1 comment 4 replies

scholzj
Mar 3, 2023
Maintainer

helenapoleri Mar 3, 2023
Author

scholzj Mar 3, 2023
Maintainer

helenapoleri Mar 7, 2023
Author

scholzj Mar 7, 2023
Maintainer