Skip to content

✨ Performance Alerting #2081

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

dtfranz
Copy link
Contributor

@dtfranz dtfranz commented Jul 8, 2025

Description

Introduces an early-warning series of prometheus alerts to attempt to catch issues with performance at an early stage in development.

As the e2e tests run, the installed prometheus instance is scraping metrics from catalogd and operator-controller, and will fire alerts based on rules introduced in this PR. Since we're running these tests on the github runners which do not have consistent performance, our alerts must be based on platform-independent metrics and are therefore limited. Any other ideas for metrics to check on this PR are appreciated!

Once the e2e tests finish, prometheus is queried for active alerts. Any alerts found in pending state will result in a warning being set on the e2e workflow. Any alerts in firing state will give an error. These errors do not (at the moment) fail the run, but are visible when the workflow details are viewed.

For instance:

Prometheus Alert Pending
operator-controller-memory-growth: operator-controller pod memory usage growing at a high rate for 5 minutes: 72.86kB/sec

I am still in the process of tuning the alerts, so at the moment I am not making this a required check.

Potential Enhancements:

  • Additional alerts, if any
  • Fine-tune the alerts and fail runs when they fire
  • Remove yaml from script and organize into an additional kustomization component
  • Output metrics as a mermaid XY plot in the workflow summary

Closes #1904
Closes #1905

Reviewer Checklist

  • API Go Documentation
  • Tests: Unit Tests (and E2E Tests, if appropriate)
  • Comprehensive Commit Messages
  • Links to related GitHub Issue(s)

@dtfranz dtfranz requested a review from a team as a code owner July 8, 2025 14:50
@openshift-ci openshift-ci bot requested review from perdasilva and trgeiger July 8, 2025 14:50
Copy link

openshift-ci bot commented Jul 8, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign joelanford for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@dtfranz dtfranz force-pushed the metrics-alerting branch from 47d8c3b to 97a268f Compare July 8, 2025 14:51
Introduces an early-warning series of prometheus alerts to attempt to catch issues with performance at an early stage in development.

Signed-off-by: Daniel Franz <dfranz@redhat.com>
@dtfranz dtfranz force-pushed the metrics-alerting branch from 97a268f to cb81424 Compare July 8, 2025 14:53
Copy link

netlify bot commented Jul 8, 2025

Deploy Preview for olmv1 ready!

Name Link
🔨 Latest commit cb81424
🔍 Latest deploy log https://app.netlify.com/projects/olmv1/deploys/686d353570003f000886140e
😎 Deploy Preview https://deploy-preview-2081--olmv1.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

Comment on lines +288 to +290
e2e-metrics: #EXHELP Request metrics from prometheus; select only actively firing alerts; place in ARTIFACT_PATH if set
ALERTS_FILE_PATH=$(if $(ARTIFACT_PATH),$(ARTIFACT_PATH),.)/alerts.out
curl -X GET http://localhost:30900/api/v1/alerts | jq 'if (.data.alerts | length) > 0 then .data.alerts.[] else empty end' > $(ALERTS_FILE_PATH)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may be complaining because ALERTS_FILE_PATH is indented, so it's not considered a make variable. So you need braces {} (and some more sh magic) or unindent the line and use :=

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Select detection criteria for CI failure of e2e metrics job Create/Modify upstream CI job
2 participants