Building Availability SLO for Kafka Cluster Utilizing Strimzi-Canary Metrics

I am working towards constructing a Service Level Objective (SLO) for our Kafka cluster's availability using Strimzi-Canary metrics. The aim is to have two distinct resources for the SLO: one to monitor consumption and the other for production.

For the Production SLI (Service Level Indicator), the plan is to employ strimzi_canary_records_produced as the reference for total events and strimzi_canary_records_produced_failed for unsuccessful events.

However, when it comes to the Consumption SLI, there doesn't seem to be a direct equivalent metric for 'failed' events as in production. The closest metric I can find is consumer_error_total.

Would love to hear your thoughts on this approach and any suggestions on how I could effectively establish my Consumption SLO. Is there a more suitable method or metrics that I should consider?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Building Availability SLO for Kafka Cluster Utilizing Strimzi-Canary Metrics #219

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Building Availability SLO for Kafka Cluster Utilizing Strimzi-Canary Metrics #219

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions