|
1 | 1 | // Module included in the following assemblies:
|
2 | 2 | //
|
3 |
| -// * monitoring/installing-monitoring-stack.adoc |
| 3 | +// * monitoring/configuring-the-monitoring-stack.adoc |
4 | 4 |
|
5 | 5 | [id="alerting-rules-{context}"]
|
6 |
| -== Alerting rules |
| 6 | += Alerting rules |
7 | 7 |
|
8 |
| -{product-title} Cluster Monitoring ships with the following alerting rules configured by default. Currently you cannot add custom alerting rules. |
| 8 | +{product-title} Cluster Monitoring by default ships with a set of pre-defined alerting rules. |
9 | 9 |
|
10 |
| -Some alerting rules have identical names. This is intentional. They are alerting about the same event with different thresholds, with different severity, or both. With the inhibition rules, the lower severity is inhibited when the higher severity is firing. |
| 10 | +Note that: |
11 | 11 |
|
12 |
| -For more details on the alerting rules, see the link:https://github.com/openshift/cluster-monitoring-operator/blob/master/assets/prometheus-k8s/rules.yaml[configuration file]. |
| 12 | +* The default alerting rules are used specifically for the {product-title} cluster and nothing else. For example, you get alerts for a persistent volume in the cluster, but you do not get them for persistent volume in your custom namespace. |
| 13 | +* Currently you cannot add custom alerting rules. |
| 14 | +* Some alerting rules have identical names. This is intentional. They are sending alerts about the same event with different thresholds, with different severity, or both. |
| 15 | +* With the inhibition rules, the lower severity is inhibited when the higher severity is firing. |
13 | 16 |
|
14 |
| -[options="header"] |
15 |
| -|=== |
16 |
| -|Alert|Severity|Description |
17 |
| -|`ClusterMonitoringOperatorErrors`|`critical`|Cluster Monitoring Operator is experiencing _X_% errors. |
18 |
| -|`AlertmanagerDown`|`critical`|Alertmanager has disappeared from Prometheus target discovery. |
19 |
| -|`ClusterMonitoringOperatorDown`|`critical`|ClusterMonitoringOperator has disappeared from Prometheus target discovery. |
20 |
| -|`KubeAPIDown`|`critical`|KubeAPI has disappeared from Prometheus target discovery. |
21 |
| -|`KubeControllerManagerDown`|`critical`|KubeControllerManager has disappeared from Prometheus target discovery. |
22 |
| -|`KubeSchedulerDown`|`critical`|KubeScheduler has disappeared from Prometheus target discovery. |
23 |
| -|`KubeStateMetricsDown`|`critical`|KubeStateMetrics has disappeared from Prometheus target discovery. |
24 |
| -|`KubeletDown`|`critical`|Kubelet has disappeared from Prometheus target discovery. |
25 |
| -|`NodeExporterDown`|`critical`|NodeExporter has disappeared from Prometheus target discovery. |
26 |
| -|`PrometheusDown`|`critical`|Prometheus has disappeared from Prometheus target discovery. |
27 |
| -|`PrometheusOperatorDown`|`critical`|PrometheusOperator has disappeared from Prometheus target discovery. |
28 |
| -|`KubePodCrashLooping`|`critical`|_Namespace/Pod_ (_Container_) is restarting _times_ / second |
29 |
| -|`KubePodNotReady`|`critical`|_Namespace/Pod_ is not ready. |
30 |
| -|`KubeDeploymentGenerationMismatch`|`critical`|Deployment _Namespace/Deployment_ generation mismatch |
31 |
| -|`KubeDeploymentReplicasMismatch`|`critical`|Deployment _Namespace/Deployment_ replica mismatch |
32 |
| -|`KubeStatefulSetReplicasMismatch`|`critical`|StatefulSet _Namespace/StatefulSet_ replica mismatch |
33 |
| -|`KubeStatefulSetGenerationMismatch`|`critical`|StatefulSet _Namespace/StatefulSet_ generation mismatch |
34 |
| -|`KubeDaemonSetRolloutStuck`|`critical`|Only _X_% of desired pods scheduled and ready for daemon set _Namespace/DaemonSet_ |
35 |
| -|`KubeDaemonSetNotScheduled`|`warning`|A number of pods of daemonset _Namespace/DaemonSet_ are not scheduled. |
36 |
| -|`KubeDaemonSetMisScheduled`|`warning`|A number of pods of daemonset _Namespace/DaemonSet_ are running where they are not supposed to run. |
37 |
| -|`KubeCronJobRunning`|`warning`|CronJob _Namespace/CronJob_ is taking more than 1h to complete. |
38 |
| -|`KubeJobCompletion`|`warning`|Job _Namespaces/Job_ is taking more than 1h to complete. |
39 |
| -|`KubeJobFailed`|`warning`|Job _Namespaces/Job_ failed to complete. |
40 |
| -|`KubeCPUOvercommit`|`warning`|Overcommited CPU resource requests on Pods, cannot tolerate node failure. |
41 |
| -|`KubeMemOvercommit`|`warning`|Overcommited Memory resource requests on Pods, cannot tolerate node failure. |
42 |
| -|`KubeCPUOvercommit`|`warning`|Overcommited CPU resource request quota on Namespaces. |
43 |
| -|`KubeMemOvercommit`|`warning`|Overcommited Memory resource request quota on Namespaces. |
44 |
| -|`alerKubeQuotaExceeded`|`warning`|_X_% usage of _Resource_ in namespace _Namespace_. |
45 |
| -|`KubePersistentVolumeUsageCritical`|`critical`|The persistent volume claimed by _PersistentVolumeClaim_ in namespace _Namespace_ has _X_% free. |
46 |
| -|`KubePersistentVolumeFullInFourDays`|`critical`|Based on recent sampling, the persistent volume claimed by _PersistentVolumeClaim_ in namespace _Namespace_ is expected to fill up within four days. Currently _X_ bytes are available. |
47 |
| -|`KubeNodeNotReady`|`warning`|_Node_ has been unready for more than an hour |
48 |
| -|`KubeVersionMismatch`|`warning`|There are _X_ different versions of Kubernetes components running. |
49 |
| -|`KubeClientErrors`|`warning`|Kubernetes API server client '_Job/Instance_' is experiencing _X_% errors.' |
50 |
| -|`KubeClientErrors`|`warning`|Kubernetes API server client '_Job/Instance_' is experiencing _X_ errors / sec.' |
51 |
| -|`KubeletTooManyPods`|`warning`|Kubelet _Instance_ is running _X_ pods, close to the limit of 110. |
52 |
| -|`KubeAPILatencyHigh`|`warning`|The API server has a 99th percentile latency of _X_ seconds for _Verb_ _Resource_. |
53 |
| -|`KubeAPILatencyHigh`|`critical`|The API server has a 99th percentile latency of _X_ seconds for _Verb_ _Resource_. |
54 |
| -|`KubeAPIErrorsHigh`|`critical`|API server is erroring for _X_% of requests. |
55 |
| -|`KubeAPIErrorsHigh`|`warning`|API server is erroring for _X_% of requests. |
56 |
| -|`KubeClientCertificateExpiration`|`warning`|Kubernetes API certificate is expiring in less than 7 days. |
57 |
| -|`KubeClientCertificateExpiration`|`critical`|Kubernetes API certificate is expiring in less than 1 day. |
58 |
| -|`AlertmanagerConfigInconsistent`|`critical`|Summary: Configuration out of sync. Description: The configuration of the instances of the Alertmanager cluster `_Service_` are out of sync. |
59 |
| -|`AlertmanagerFailedReload`|`warning`|Summary: Alertmanager's configuration reload failed. Description: Reloading Alertmanager's configuration has failed for _Namespace/Pod_. |
60 |
| -|`TargetDown`|`warning`|Summary: Targets are down. Description: _X_% of _Job_ targets are down. |
61 |
| -|`DeadMansSwitch`|`none`|Summary: Alerting DeadMansSwitch. Description: This is a DeadMansSwitch meant to ensure that the entire Alerting pipeline is functional. |
62 |
| -|`NodeDiskRunningFull`|`warning`|Device _Device_ of node-exporter _Namespace/Pod_ is running full within the next 24 hours. |
63 |
| -|`NodeDiskRunningFull`|`critical`|Device _Device_ of node-exporter _Namespace/Pod_ is running full within the next 2 hours. |
64 |
| -|`PrometheusConfigReloadFailed`|`warning`|Summary: Reloading Prometheus' configuration failed. Description: Reloading Prometheus' configuration has failed for _Namespace/Pod_ |
65 |
| -|`PrometheusNotificationQueueRunningFull`|`warning`|Summary: Prometheus' alert notification queue is running full. Description: Prometheus' alert notification queue is running full for _Namespace/Pod_ |
66 |
| -|`PrometheusErrorSendingAlerts`|`warning`|Summary: Errors while sending alert from Prometheus. Description: Errors while sending alerts from Prometheus _Namespace/Pod_ to Alertmanager _Alertmanager_ |
67 |
| -|`PrometheusErrorSendingAlerts`|`critical`|Summary: Errors while sending alerts from Prometheus. Description: Errors while sending alerts from Prometheus _Namespace/Pod_ to Alertmanager _Alertmanager_ |
68 |
| -|`PrometheusNotConnectedToAlertmanagers`|`warning`|Summary: Prometheus is not connected to any Alertmanagers. Description: Prometheus _Namespace/Pod_ is not connected to any Alertmanagers |
69 |
| -|`PrometheusTSDBReloadsFailing`|`warning`|Summary: Prometheus has issues reloading data blocks from disk. Description: _Job_ at _Instance_ had _X_ reload failures over the last four hours. |
70 |
| -|`PrometheusTSDBCompactionsFailing`|`warning`|Summary: Prometheus has issues compacting sample blocks. Description: _Job_ at _Instance_ had _X_ compaction failures over the last four hours. |
71 |
| -|`PrometheusTSDBWALCorruptions`|`warning`|Summary: Prometheus write-ahead log is corrupted. Description: _Job_ at _Instance_ has a corrupted write-ahead log (WAL). |
72 |
| -|`PrometheusNotIngestingSamples`|`warning`|Summary: Prometheus isn't ingesting samples. Description: Prometheus _Namespace/Pod_ isn't ingesting samples. |
73 |
| -|`PrometheusTargetScrapesDuplicate`|`warning`|Summary: Prometheus has many samples rejected. Description: _Namespace/Pod_ has many samples rejected due to duplicate timestamps but different values |
74 |
| -|`EtcdInsufficientMembers`|`critical`|Etcd cluster "_Job_": insufficient members (_X_). |
75 |
| -|`EtcdNoLeader`|`critical`|Etcd cluster "_Job_": member _Instance_ has no leader. |
76 |
| -|`EtcdHighNumberOfLeaderChanges`|`warning`|Etcd cluster "_Job_": instance _Instance_ has seen _X_ leader changes within the last hour. |
77 |
| -|`EtcdHighNumberOfFailedGRPCRequests`|`warning`|Etcd cluster "_Job_": _X_% of requests for _GRPC_Method_ failed on etcd instance _Instance_. |
78 |
| -|`EtcdHighNumberOfFailedGRPCRequests`|`critical`|Etcd cluster "_Job_": _X_% of requests for _GRPC_Method_ failed on etcd instance _Instance_. |
79 |
| -|`EtcdGRPCRequestsSlow`|`critical`|Etcd cluster "_Job_": gRPC requests to _GRPC_Method_ are taking _X_s on etcd instance _Instance_. |
80 |
| -|`EtcdMemberCommunicationSlow`|`warning`|Etcd cluster "_Job_": member communication with _To_ is taking _X_s on etcd instance _Instance_. |
81 |
| -|`EtcdHighNumberOfFailedProposals`|`warning`|Etcd cluster "_Job_": _X_ proposal failures within the last hour on etcd instance _Instance_. |
82 |
| -|`EtcdHighFsyncDurations`|`warning`|Etcd cluster "_Job_": 99th percentile fync durations are _X_s on etcd instance _Instance_. |
83 |
| -|`EtcdHighCommitDurations`|`warning`|Etcd cluster "_Job_": 99th percentile commit durations _X_s on etcd instance _Instance_. |
84 |
| -|`FdExhaustionClose`|`warning`|_Job_ instance _Instance_ will exhaust its file descriptors soon |
85 |
| -|`FdExhaustionClose`|`critical`|_Job_ instance _Instance_ will exhaust its file descriptors soon |
86 |
| -|=== |
| 17 | +.Additional resources |
87 | 18 |
|
| 19 | +* See the link:https://github.com/openshift/cluster-monitoring-operator/blob/master/Documentation/user-guides/default-alerts.md[default alerts table]. |
0 commit comments