Skip to content

Commit 3716026

Browse files
committed
feat: raise alert on long running queries per user instead of single pid
This commit modifies the existing alert on long running queries to be raised at the user level instead of the finer granularity of the pid level. The reason is that if a query holds a lock which blocks several queries, we will raise an alert for each query on hold, therefore spamming our alert channels and creating noise. Reducing the granularity to the user will improve readability and also understand quicker which apps and services are impacted.
1 parent 1328088 commit 3716026

File tree

7 files changed

+57
-52
lines changed

7 files changed

+57
-52
lines changed

charts/prometheus-postgresql-alerts/prometheus_tests/PostgreSQLLongRunningQuery.yml renamed to charts/prometheus-postgresql-alerts/prometheus_tests/PostgreSQLLongRunningQueries.yml

Lines changed: 5 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -5,22 +5,21 @@ evaluation_interval: 1m
55

66
tests:
77

8-
- name: PostgreSQLLongRunningQuery
8+
- name: PostgreSQLLongRunningQueries
99
interval: 1m
1010
input_series:
1111
- series: 'pg_active_backend_duration_minutes{target="db1",datname="unittest",usename="test",pid="1234"}'
1212
values: 40+1x10
1313
alert_rule_test:
14-
- alertname: PostgreSQLLongRunningQuery
14+
- alertname: PostgreSQLLongRunningQueries
1515
eval_time: 1m
1616
exp_alerts:
1717
- exp_labels:
1818
target: db1
1919
datname: unittest
2020
usename: test
2121
severity: warning
22-
pid: 1234
2322
exp_annotations:
24-
summary: "Long running query on unittest of db1"
25-
description: "test is running a long query on unittest of db1 with pid 1234"
26-
runbook_url: "https://qonto.github.io/database-monitoring-framework/0.0.0/runbooks/postgresql/PostgreSQLLongRunningQuery"
23+
summary: "Long running queries on unittest of db1 initiated by test"
24+
description: "test is running long queries on unittest of db1"
25+
runbook_url: "https://qonto.github.io/database-monitoring-framework/0.0.0/runbooks/postgresql/PostgreSQLLongRunningQueries"

charts/prometheus-postgresql-alerts/values.yaml

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -77,14 +77,14 @@ rules:
7777
summary: "Physical replication slot is inactive"
7878
description: "{{ $labels.slot_name }} on {{ $labels.target }} is inactive"
7979

80-
PostgreSQLLongRunningQuery:
81-
expr: max by (target, datname, usename, pid) (pg_active_backend_duration_minutes{usename!=""}) > 30
80+
PostgreSQLLongRunningQueries:
81+
expr: max by (target, datname, usename) (pg_active_backend_duration_minutes{usename!=""}) > 30
8282
for: 1m
8383
labels:
8484
severity: warning
8585
annotations:
86-
summary: "Long running query on {{ $labels.datname }} of {{ $labels.target }}"
87-
description: "{{ $labels.usename }} is running a long query on {{ $labels.datname }} of {{ $labels.target }} with pid {{ $labels.pid }}"
86+
summary: "Long running queries on {{ $labels.datname }} of {{ $labels.target }} initiated by {{ $labels.usename }}"
87+
description: "{{ $labels.usename }} is running long queries on {{ $labels.datname }} of {{ $labels.target }}"
8888
pintComments:
8989
- disable promql/series
9090

Lines changed: 45 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,45 @@
1+
---
2+
title: Long running queries
3+
---
4+
5+
# PostgreSQLLongRunningQueries
6+
7+
## Meaning
8+
9+
Alert is triggered when SQL queries run for an extended period.
10+
11+
## Impact
12+
13+
- Block WAL file rotation
14+
15+
- Could block vacuum operations
16+
17+
- Could block other queries due to locks
18+
19+
- Could lead to replication lag on replica
20+
21+
## Diagnosis
22+
23+
1. Open `PostgreSQL server live` dashboard
24+
25+
1. Click on the queries to get details
26+
27+
## Mitigation
28+
29+
1. Identify the PIDs of the long running queries
30+
31+
{{< details title="SQL" open=false >}}
32+
{{% sql "../postgresql/sql/list-long-running-transactions.sql" %}}
33+
{{< /details >}}
34+
35+
1. Cancel the queries
36+
37+
{{% sql "sql/cancel_backend.sql" %}}
38+
39+
1. If queries do not get cancelled, kill them
40+
41+
{{% sql "sql/terminate_backend.sql" %}}
42+
43+
## Additional resources
44+
45+
n/a

content/runbooks/postgresql/PostgreSQLLongRunningQuery.md

Lines changed: 0 additions & 39 deletions
This file was deleted.

content/runbooks/postgresql/SQLExporterScrapingLimit.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -27,7 +27,7 @@ The monitoring system is degraded. SQL exporter does not collect SQL metrics, al
2727
1. Identify and kill heavy queries
2828

2929
<details>
30-
<summary>How terminate a query?</summary>
30+
<summary>How to terminate queries?</summary>
3131

3232
{{% sql "sql/terminate_backend.sql" %}}
3333

content/runbooks/postgresql/sql/cancel_backend.sql

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -11,4 +11,4 @@ SELECT
1111
state_change,
1212
query
1313
FROM pg_stat_activity
14-
WHERE pid = <replace_with_pid>;
14+
WHERE pid in ('<replace_with_pids>');

content/runbooks/postgresql/sql/terminate_backend.sql

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -11,4 +11,4 @@ SELECT
1111
state_change,
1212
query
1313
FROM pg_stat_activity
14-
WHERE pid = <replace_with_pid>;
14+
WHERE pid in ('<replace_with_pids>');

0 commit comments

Comments
 (0)