Skip to content

Distribution metrics #1023

@humanagent

Description

@humanagent

Problem

Current monitoring uses average metrics (avg:xmtp.sdk.duration, avg:xmtp.sdk.delivery) which mask performance outliers and don't provide granular insights for SLA monitoring.

  • Average response times hide tail latencies
  • Cannot detect issues affecting small user percentages
  • Difficult to set meaningful percentile-based SLA thresholds

Solution

Replace average metrics with distribution metrics using Datadog histograms and percentiles (p50, p95, p99) to capture the full performance picture.

# Current: avg:xmtp.sdk.duration{...}
# New: p50, p95, p99:xmtp.sdk.duration{...}

Update sendMetric() function, dashboard widgets, and alerting thresholds to use percentile data.

Benefits

  • Better outlier detection: Spot performance issues affecting small user percentages
  • More meaningful SLAs: Set targets like "p95 response time < 500ms"
  • Early warning: Detect gradual performance degradation in tail latencies
  • Regional insights: Compare distribution shapes across regions

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions