Skip to content
This repository was archived by the owner on Nov 29, 2024. It is now read-only.
This repository was archived by the owner on Nov 29, 2024. It is now read-only.

Bug: Missing EXPECTED_CLUSTER_SIZE leads to massive load on brokers #221

@pantaoran

Description

@pantaoran

We observed that when EXPECTED_CLUSTER_SIZE is not set (or explicitly set to -1), this destroyed measured produce latencies.
It seems that before (or during?) every request, Canary was trying to micro-manage the replicas and their leaders for the canary topic on the Kafka cluster, which was taking a lot of time and processing, resulting in extremely slow responses to the produce requests.

Average latencies as reported when EXPECTED_CLUSTER_SIZE is set correctly: 3-5ms
Average latencies as reported when EXPECTED_CLUSTER_SIZE is NOT set: 1000-2000ms

Somehow the things that canary does on the cluster slow everything down dramatically.
It also leads to an explosion in logs. With the correct setting, my empty brokers (2-broker cluster, no other clients running except Canary) logged around 8 lines per minute. When the cluster size setting is missing, they logged around 500 lines per minute (the canary reconcile interval was 10sec=default).

I don't know what Canary does in detail or why, but it feels like a bug to me.

The description in the README says that I should expect more partitions reassignment of the topic while the Kafka cluster is starting up and the brokers are coming one by one, but what I actually observe is that partitions are getting reassigned on every reconciliation (every 10sec), leading to redundant work on the brokers, which cause high produce latencies and increased log volume.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions