Strimzi cruise control metric #11728

kspmmitt · 2025-08-08T07:19:00Z

kspmmitt
Aug 8, 2025

Hello,

Using strimzi kafka operator with cruise control.

There are metrices as below, can you please give input for the same?

kafka_cruisecontrol_KafkaCruiseControlServlet_REBALANCE_request_rate_Count - this metric shows much higher value as compared to rebalancing done in actual. I am doing manual rebalancing.
kafka_cruisecontrol_AnomalyDetector_balancedness_score_Value - self-healing is disabled, and no anomaly detector goals are set, still this metric shows value all the time as 100.

Cruise control logs -

AnomalyDetectorState: {selfHealingEnabled:[], selfHealingDisabled:[DISK_FAILURE, BROKER_FAILURE, GOAL_VIOLATION, METRIC_ANOMALY, TOPIC_ANOMALY, MAINTENANCE_EVENT], selfHealingEnabledRatio:{DISK_FAILURE=0.0, BROKER_FAILURE=0.0, GOAL_VIOLATION=0.0, METRIC_ANOMALY=0.0, TOPIC_ANOMALY=0.0, MAINTENANCE_EVENT=0.0}, recentGoalViolations:[], recentBrokerFailures:[], recentMetricAnomalies:[], recentDiskFailures:[], recentTopicAnomalies:[], recentMaintenanceEvents:[], metrics:{meanTimeBetweenAnomalies:{GOAL_VIOLATION:0.00 milliseconds, BROKER_FAILURE:0.00 milliseconds, METRIC_ANOMALY:0.00 milliseconds, DISK_FAILURE:0.00 milliseconds, TOPIC_ANOMALY:0.00 milliseconds}, meanTimeToStartFix:0.00 milliseconds, numSelfHealingStarted:0, numSelfHealingFailedToStart:0, ongoingAnomalyDuration=0.00 milliseconds}, ongoingSelfHealingAnomaly:None, balancednessScore:100.000}

Answered by kyguy

Aug 12, 2025

but it is not the right metric to expose,

What would be the right metric to expose here?

Note that this dashboard example only exposes metrics that are provided by Cruise Control "sensors".

it is not giving real rebalance execution count

From what I understand from the upstream Cruise Control wiki this metric isn't supposed be the number of rebalances executed. It is supposed to be the average number of HTTP requests to Cruise Control's "REBALANCE" endpoint

View full answer

kyguy · 2025-08-08T15:11:31Z

kyguy
Aug 8, 2025
Collaborator

kafka_cruisecontrol_KafkaCruiseControlServlet_REBALANCE_request_rate_Count - this metric shows much higher value as compared to rebalancing done in actual. I am doing manual rebalancing.

This is the number of times the REBALANCE endpoint is requested. It is not only incremented when a rebalance is executed but also when the proposal, status, or result of that rebalance is being checked. So even if a single KafkaRebalance resource was created for a partition rebalance, the Strimzi Operator will hit this endpoint several times throughout the lifecycle of the rebalancing process.

kafka_cruisecontrol_AnomalyDetector_balancedness_score_Value - self-healing is disabled, and no anomaly detector goals are set, still this metric shows value all the time as 100.

The default value is set to 100, the score is only decreased by anomaly detection goals that are violated. Since no anomaly detection goals are listed, no anomaly detection goals are violated and the balancedness score stays at 100.

0 replies

kspmmitt · 2025-08-08T15:20:45Z

kspmmitt
Aug 8, 2025
Author

Then this metric cannot be used to monitor how frequent rebalance is happing. This will confuse user. Any other metric which is more suitable and gives better view of rebalancing rate?

…

On Fri, 8 Aug, 2025, 8:41 pm Kyle Liberti, ***@***.***> wrote: kafka_cruisecontrol_KafkaCruiseControlServlet_REBALANCE_request_rate_Count - this metric shows much higher value as compared to rebalancing done in actual. I am doing manual rebalancing. This is the number of times the REBALANCE endpoint is requested. It is not only incremented when a rebalance is executed but also when the proposal, status, or result of that rebalance is being checked. So even if a single KafkaRebalance resource was created for a partition rebalance, the Strimzi Operator will hit this endpoint several times throughout the lifecycle of the rebalancing process. kafka_cruisecontrol_AnomalyDetector_balancedness_score_Value - self-healing is disabled, and no anomaly detector goals are set, still this metric shows value all the time as 100. The default value is set to 100, the score is only decreased by anomaly detection goals that are violated. Since no anomaly detection goals are listed, no anomaly detection goals are violated and the balancedness score stays at 100. — Reply to this email directly, view it on GitHub <#11728 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/A5KYFZCHCDYHA6A6BHGN6V33MS43RAVCNFSM6AAAAACDNB6XOSVHI2DSMVQWIX3LMV43URDJONRXK43TNFXW4Q3PNVWWK3TUHMYTIMBUG42TINQ> . You are receiving this because you authored the thread.Message ID: <strimzi/strimzi-kafka-operator/repo-discussions/11728/comments/14047546@ github.com>

7 replies

kspmmitt Aug 11, 2025
Author

Hello @ppatierno and @kyguy

I saw the example grafana dashboards given in strimzi repository - https://github.com/strimzi/strimzi-kafka-operator/blob/0.39.0/examples/metrics/grafana-dashboards/strimzi-cruise-control.json

It gives same rebalance rate but looks like this is not correct metric to use as explained above?

Also can I use below metric in some sense for cluster load in terms of cruise control? But not able to judge it's value (what is high or what is low?)

kafka_cruisecontrol_LoadMonitor_metadata_factor_Number

Also below metric is boolean type? IF unfixable proposal then 1 else 0?

kafka_cruisecontrol_GoalOptimizer_has_unfixable_proposal_optimization_Value

kyguy Aug 11, 2025
Collaborator

It gives same rebalance rate but looks like this is not correct metric to use as explained above?

That Grafana dashboard is exposing native Cruise Control metrics, the metric I assume you are referring to is the "Rebalance Request Rate", this is the kafka_cruisecontrol_KafkaCruiseControlServlet_REBALANCE_request_rate_Count metric mentioned in the original post. The dashboard doesn't expose the Strimzi fields mentioned in this thread. To access those, use the method described in the docs here [1]

Also can I use below metric in some sense for cluster load in terms of cruise control? But not able to judge it's value (what is high or what is low?)

For getting a sense of the cluster load I would focus on the metrics provided by the /kafkacruisecontrol/load endpoint [2] instead of the Cruise Control "sensors" [3]

kafka_cruisecontrol_LoadMonitor_metadata_factor_Number

This is to help understand the impact of the current replica distribution on metadata growth/coordination between brokers.

kafka_cruisecontrol_GoalOptimizer_has_unfixable_proposal_optimization_Value

This is to flag whether the cluster requires manual intervention to fix one of the goal violations.

[1] https://strimzi.io/docs/operators/latest/deploying#proc-tracking-cluster-rebalance-str
[2] https://github.com/linkedin/cruise-control/wiki/REST-APIs#query-the-current-cluster-load
[3] https://github.com/linkedin/cruise-control/wiki/Sensors

kspmmitt Aug 12, 2025
Author

I was just saying that default Grafana dashboard (https://github.com/strimzi/strimzi-kafka-operator/blob/0.39.0/examples/metrics/grafana-dashboards/strimzi-cruise-control.json) given by Strimzi exposing metric (kafka_cruisecontrol_KafkaCruiseControlServlet_REBALANCE_request_rate_Count) but it is not the right metric to expose, it is not giving real rebalance execution count so don't know why this is exposed and also don't know when to use this metric and in which way as it's calculation of knowing rate is not only execution.

So, I was just highlighting that probably this should not be exposed at all. Looks like either I misunderstood something or?

kyguy Aug 12, 2025
Collaborator

but it is not the right metric to expose,

What would be the right metric to expose here?

Note that this dashboard example only exposes metrics that are provided by Cruise Control "sensors".

it is not giving real rebalance execution count

From what I understand from the upstream Cruise Control wiki this metric isn't supposed be the number of rebalances executed. It is supposed to be the average number of HTTP requests to Cruise Control's "REBALANCE" endpoint

Answer selected by kspmmitt

Strimzi

Strimzi cruise control metric #11728

Uh oh!

Uh oh!

kspmmitt Aug 8, 2025

Replies: 2 comments · 7 replies

Uh oh!

kyguy Aug 8, 2025 Collaborator

Uh oh!

kspmmitt Aug 8, 2025 Author

Uh oh!

Uh oh!

kspmmitt Aug 11, 2025 Author

Uh oh!

kyguy Aug 11, 2025 Collaborator

Uh oh!

Uh oh!

kspmmitt Aug 12, 2025 Author

Uh oh!

Uh oh!

kyguy Aug 12, 2025 Collaborator

kspmmitt
Aug 8, 2025

Replies: 2 comments 7 replies

kyguy
Aug 8, 2025
Collaborator

kspmmitt
Aug 8, 2025
Author

kspmmitt Aug 11, 2025
Author

kyguy Aug 11, 2025
Collaborator

kspmmitt Aug 12, 2025
Author

kyguy Aug 12, 2025
Collaborator