Skip to content
This repository was archived by the owner on Nov 29, 2024. It is now read-only.
This repository was archived by the owner on Nov 29, 2024. It is now read-only.

Building Availability SLO for Kafka Cluster Utilizing Strimzi-Canary Metrics #219

@OuesFa

Description

@OuesFa

I am working towards constructing a Service Level Objective (SLO) for our Kafka cluster's availability using Strimzi-Canary metrics. The aim is to have two distinct resources for the SLO: one to monitor consumption and the other for production.

For the Production SLI (Service Level Indicator), the plan is to employ strimzi_canary_records_produced as the reference for total events and strimzi_canary_records_produced_failed for unsuccessful events.

However, when it comes to the Consumption SLI, there doesn't seem to be a direct equivalent metric for 'failed' events as in production. The closest metric I can find is consumer_error_total.

Would love to hear your thoughts on this approach and any suggestions on how I could effectively establish my Consumption SLO. Is there a more suitable method or metrics that I should consider?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions