Skip to content
This repository was archived by the owner on Sep 4, 2024. It is now read-only.

Reduce False Positive alarms #420

Merged
merged 5 commits into from
May 13, 2024
Merged

Reduce False Positive alarms #420

merged 5 commits into from
May 13, 2024

Conversation

codyborn
Copy link
Collaborator

@codyborn codyborn commented May 10, 2024

This change is part of an effort to reduce noise in our operations alerting. If deployed 2 weeks ago, we'd have 10x less incidents from URA.

URA is the biggest contributor to our oncall incidents (sampling the last 25 incidents).
image

10/12 URA alarms were UnifiedRoutingAPI-SEV2-LatencyP99.

Most of these were caused by the Routing API (a dependency of URA) having a spike in latency. There is usually no action we can take and this adds to the noise.

This PR proposes two changes:

  1. Creates new conditional alerts that are sensitive but also specific. They trigger on a short time frame, but will not trigger when Routing API is also experiencing high latency.
  2. Changes existing simple latency alerts to be less sensitive (5 minutes -> 20 minutes period)

This is what the new conditional alerts look like when deployed:
Screenshot 2024-05-10 at 4 43 51 PM

Copy link
Collaborator

@zhongeric zhongeric left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amazing!! This is much needed

label: 'Latency Alarm',
usingMetrics: {
ura_high_latency: new aws_cloudwatch.MathExpression({
expression: "IF(overall_latency > 7000, 1, 0)",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: would be nice to put these numbers into constants (URA_LATENCY_SEV3 = 7000, ROUTING_API_LATENCY_SEV3 = 4000, etc.) so we can change them all in one place

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great callout! Moved these magic numbers to the constants file.

@codyborn codyborn merged commit de066cc into main May 13, 2024
6 checks passed
@codyborn codyborn deleted the reduce_fp_alarms branch May 13, 2024 12:24
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants