Skip to content

Commit 0aff185

Browse files
authored
Merge pull request #29 from qonto/group-alerts-for-long-running-queries-to-avoid-alert-spam
feat: raise alert on long running queries per user instead of single pid
2 parents 438aa6f + 3716026 commit 0aff185

File tree

7 files changed

+57
-52
lines changed

7 files changed

+57
-52
lines changed

charts/prometheus-postgresql-alerts/prometheus_tests/PostgreSQLLongRunningQuery.yml renamed to charts/prometheus-postgresql-alerts/prometheus_tests/PostgreSQLLongRunningQueries.yml

Lines changed: 5 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -5,22 +5,21 @@ evaluation_interval: 1m
55

66
tests:
77

8-
- name: PostgreSQLLongRunningQuery
8+
- name: PostgreSQLLongRunningQueries
99
interval: 1m
1010
input_series:
1111
- series: 'pg_active_backend_duration_minutes{target="db1",datname="unittest",usename="test",pid="1234"}'
1212
values: 40+1x10
1313
alert_rule_test:
14-
- alertname: PostgreSQLLongRunningQuery
14+
- alertname: PostgreSQLLongRunningQueries
1515
eval_time: 1m
1616
exp_alerts:
1717
- exp_labels:
1818
target: db1
1919
datname: unittest
2020
usename: test
2121
severity: warning
22-
pid: 1234
2322
exp_annotations:
24-
summary: "Long running query on unittest of db1"
25-
description: "test is running a long query on unittest of db1 with pid 1234"
26-
runbook_url: "https://qonto.github.io/database-monitoring-framework/0.0.0/runbooks/postgresql/PostgreSQLLongRunningQuery"
23+
summary: "Long running queries on unittest of db1 initiated by test"
24+
description: "test is running long queries on unittest of db1"
25+
runbook_url: "https://qonto.github.io/database-monitoring-framework/0.0.0/runbooks/postgresql/PostgreSQLLongRunningQueries"

charts/prometheus-postgresql-alerts/values.yaml

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -77,14 +77,14 @@ rules:
7777
summary: "Physical replication slot is inactive"
7878
description: "{{ $labels.slot_name }} on {{ $labels.target }} is inactive"
7979

80-
PostgreSQLLongRunningQuery:
81-
expr: max by (target, datname, usename, pid) (pg_active_backend_duration_minutes{usename!=""}) > 30
80+
PostgreSQLLongRunningQueries:
81+
expr: max by (target, datname, usename) (pg_active_backend_duration_minutes{usename!=""}) > 30
8282
for: 1m
8383
labels:
8484
severity: warning
8585
annotations:
86-
summary: "Long running query on {{ $labels.datname }} of {{ $labels.target }}"
87-
description: "{{ $labels.usename }} is running a long query on {{ $labels.datname }} of {{ $labels.target }} with pid {{ $labels.pid }}"
86+
summary: "Long running queries on {{ $labels.datname }} of {{ $labels.target }} initiated by {{ $labels.usename }}"
87+
description: "{{ $labels.usename }} is running long queries on {{ $labels.datname }} of {{ $labels.target }}"
8888
pintComments:
8989
- disable promql/series
9090

Lines changed: 45 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,45 @@
1+
---
2+
title: Long running queries
3+
---
4+
5+
# PostgreSQLLongRunningQueries
6+
7+
## Meaning
8+
9+
Alert is triggered when SQL queries run for an extended period.
10+
11+
## Impact
12+
13+
- Block WAL file rotation
14+
15+
- Could block vacuum operations
16+
17+
- Could block other queries due to locks
18+
19+
- Could lead to replication lag on replica
20+
21+
## Diagnosis
22+
23+
1. Open `PostgreSQL server live` dashboard
24+
25+
1. Click on the queries to get details
26+
27+
## Mitigation
28+
29+
1. Identify the PIDs of the long running queries
30+
31+
{{< details title="SQL" open=false >}}
32+
{{% sql "../postgresql/sql/list-long-running-transactions.sql" %}}
33+
{{< /details >}}
34+
35+
1. Cancel the queries
36+
37+
{{% sql "sql/cancel_backend.sql" %}}
38+
39+
1. If queries do not get cancelled, kill them
40+
41+
{{% sql "sql/terminate_backend.sql" %}}
42+
43+
## Additional resources
44+
45+
n/a

content/runbooks/postgresql/PostgreSQLLongRunningQuery.md

Lines changed: 0 additions & 39 deletions
This file was deleted.

content/runbooks/postgresql/SQLExporterScrapingLimit.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -27,7 +27,7 @@ The monitoring system is degraded. SQL exporter does not collect SQL metrics, al
2727
1. Identify and kill heavy queries
2828

2929
<details>
30-
<summary>How terminate a query?</summary>
30+
<summary>How to terminate queries?</summary>
3131

3232
{{% sql "sql/terminate_backend.sql" %}}
3333

content/runbooks/postgresql/sql/cancel_backend.sql

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -11,4 +11,4 @@ SELECT
1111
state_change,
1212
query
1313
FROM pg_stat_activity
14-
WHERE pid = <replace_with_pid>;
14+
WHERE pid in ('<replace_with_pids>');

content/runbooks/postgresql/sql/terminate_backend.sql

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -11,4 +11,4 @@ SELECT
1111
state_change,
1212
query
1313
FROM pg_stat_activity
14-
WHERE pid = <replace_with_pid>;
14+
WHERE pid in ('<replace_with_pids>');

0 commit comments

Comments
 (0)