-
Notifications
You must be signed in to change notification settings - Fork 1
Open
Labels
enhancementNew feature or requestNew feature or request
Description
Problem
Current monitoring uses average metrics (avg:xmtp.sdk.duration
, avg:xmtp.sdk.delivery
) which mask performance outliers and don't provide granular insights for SLA monitoring.
- Average response times hide tail latencies
- Cannot detect issues affecting small user percentages
- Difficult to set meaningful percentile-based SLA thresholds
Solution
Replace average metrics with distribution metrics using Datadog histograms and percentiles (p50, p95, p99) to capture the full performance picture.
# Current: avg:xmtp.sdk.duration{...}
# New: p50, p95, p99:xmtp.sdk.duration{...}
Update sendMetric()
function, dashboard widgets, and alerting thresholds to use percentile data.
Benefits
- Better outlier detection: Spot performance issues affecting small user percentages
- More meaningful SLAs: Set targets like "p95 response time < 500ms"
- Early warning: Detect gradual performance degradation in tail latencies
- Regional insights: Compare distribution shapes across regions
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request