Skip to content

Commit 590a04e

Browse files
authored
Merge pull request #14149 from rh-max/enterprise-4.0-monitoring-for-hpa
Monitoring documentation for enterprise v4
2 parents d80ef1b + cd20b05 commit 590a04e

File tree

45 files changed

+1188
-194
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

45 files changed

+1188
-194
lines changed

_topic_map.yml

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -499,10 +499,10 @@ Topics:
499499
- Name: Administrator CLI commands
500500
File: administrator-cli-commands
501501
Distros: openshift-enterprise,openshift-origin,openshift-dedicated
502-
#---
503-
#Name: Monitoring
504-
#Dir: monitoring
505-
#Distros: openshift-*
506-
#Topics:
507-
#- Name: Prometheus cluster monitoring
508-
# File: prometheus-cluster-monitoring
502+
---
503+
Name: Monitoring
504+
Dir: monitoring
505+
Distros: openshift-*
506+
Topics:
507+
- Name: Monitoring
508+
File: monitoring

images/alert-overview.png

62 KB
Loading

images/alerting-rule-overview.png

62.2 KB
Loading

images/alerts-screen.png

108 KB
Loading

images/create-silence.png

41.4 KB
Loading

images/silence-overview.png

64.4 KB
Loading

images/silences-screen.png

74.7 KB
Loading
Lines changed: 60 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,60 @@
1+
// Module included in the following assemblies:
2+
//
3+
// * monitoring/monitoring.adoc
4+
5+
[id='monitoring-about-cluster-monitoring-{context}']
6+
= About cluster monitoring
7+
8+
{product-title} includes a pre-configured, pre-installed, and self-updating monitoring stack that is based on the link:https://prometheus.io/[Prometheus] open source project and its wider eco-system. It provides monitoring of cluster components and ships with a set of alerts to immediately notify the cluster administrator about any occurring problems and a set of link:https://grafana.com/[Grafana] dashboards.
9+
10+
The monitoring stack includes these components:
11+
12+
* Cluster Monitoring Operator
13+
* Prometheus Operator
14+
* Prometheus
15+
* Prometheus Adapter
16+
* Alertmanager
17+
* `kube-state-metrics`
18+
* `node-exporter`
19+
* Grafana
20+
21+
The {product-title} Cluster Monitoring Operator (CMO) is the central component of the stack. It controls the deployed monitoring components and resources and ensures that they are always up to date.
22+
23+
The Prometheus Operator (PO) creates, configures, and manages Prometheus and Alertmanager instances. It also automatically generates monitoring target configurations based on familiar Kubernetes label queries.
24+
25+
The Prometheus Adapter exposes cluster resource metrics API for horizontal pod autoscaling. Resource metrics are CPU and memory utilization.
26+
27+
`node-exporter` is an agent deployed on every node to collect metrics about it.
28+
29+
The `kube-state-metrics` exporter agent converts Kubernetes objects to metrics that Prometheus can use.
30+
31+
All the components of the monitoring stack are monitored by the stack and are automatically updated when {product-title} is updated.
32+
33+
In addition to the components of the stack itself, the monitoring stack monitors:
34+
35+
* `cluster-version-operator`
36+
* `image-registry`
37+
* `kube-apiserver`
38+
* `kube-apiserver-operator`
39+
* `kube-controller-manager`
40+
* `kube-controller-manager-operator`
41+
* `kube-scheduler`
42+
* `kubelet`
43+
* `monitor-sdn`
44+
* `openshift-apiserver`
45+
* `openshift-apiserver-operator`
46+
* `openshift-controller-manager`
47+
* `openshift-controller-manager-operator`
48+
* `openshift-svcat-controller-manager-operator`
49+
* `telemeter-client`
50+
51+
Other {product-title} framework components might be exposing metrics as well. For details, see their respective documentation.
52+
53+
[NOTE]
54+
====
55+
To ensure compatibility with future {product-title} updates, configuring only the specified monitoring stack options is supported.
56+
====
57+
58+
.Additional resources
59+
60+
For more information about the {product-title} Cluster Monitoring Operator, see the link:https://github.com/openshift/cluster-monitoring-operator[Cluster Monitoring Operator] GitHub project.

modules/monitoring-accessing-prometheus-alertmanager-grafana.adoc

Lines changed: 2 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,19 +1,17 @@
11
// Module included in the following assemblies:
22
//
3-
// * monitoring/installing-monitoring-stack.adoc
3+
// * monitoring/configuring-the-monitoring-stack.adoc
44

55
[id="accessing-prometheus-alertmanager-and-grafana-{context}"]
66
= Accessing Prometheus, Alertmanager, and Grafana
77

8-
{product-title} Monitoring ships with a Prometheus instance for cluster monitoring and a central Alertmanager cluster. In addition to Prometheus and Alertmanager, {product-title} Monitoring also includes a https://grafana.com/[Grafana] instance as well as pre-built dashboards for cluster monitoring troubleshooting.
9-
108
You can get the addresses for accessing Prometheus, Alertmanager, and Grafana web UIs.
119

1210
.Procedure
1311

1412
* Run:
1513
+
16-
[subs="quotes"]
14+
[subs=quotes]
1715
$ oc -n openshift-monitoring get routes
1816
NAME HOST/PORT ...
1917
alertmanager-main alertmanager-main-openshift-monitoring.apps._url_.openshift.com ...
Lines changed: 10 additions & 78 deletions
Original file line numberDiff line numberDiff line change
@@ -1,87 +1,19 @@
11
// Module included in the following assemblies:
22
//
3-
// * monitoring/installing-monitoring-stack.adoc
3+
// * monitoring/configuring-the-monitoring-stack.adoc
44

55
[id="alerting-rules-{context}"]
6-
== Alerting rules
6+
= Alerting rules
77

8-
{product-title} Cluster Monitoring ships with the following alerting rules configured by default. Currently you cannot add custom alerting rules.
8+
{product-title} Cluster Monitoring by default ships with a set of pre-defined alerting rules.
99

10-
Some alerting rules have identical names. This is intentional. They are alerting about the same event with different thresholds, with different severity, or both. With the inhibition rules, the lower severity is inhibited when the higher severity is firing.
10+
Note that:
1111

12-
For more details on the alerting rules, see the link:https://github.com/openshift/cluster-monitoring-operator/blob/master/assets/prometheus-k8s/rules.yaml[configuration file].
12+
* The default alerting rules are used specifically for the {product-title} cluster and nothing else. For example, you get alerts for a persistent volume in the cluster, but you do not get them for persistent volume in your custom namespace.
13+
* Currently you cannot add custom alerting rules.
14+
* Some alerting rules have identical names. This is intentional. They are sending alerts about the same event with different thresholds, with different severity, or both.
15+
* With the inhibition rules, the lower severity is inhibited when the higher severity is firing.
1316

14-
[options="header"]
15-
|===
16-
|Alert|Severity|Description
17-
|`ClusterMonitoringOperatorErrors`|`critical`|Cluster Monitoring Operator is experiencing _X_% errors.
18-
|`AlertmanagerDown`|`critical`|Alertmanager has disappeared from Prometheus target discovery.
19-
|`ClusterMonitoringOperatorDown`|`critical`|ClusterMonitoringOperator has disappeared from Prometheus target discovery.
20-
|`KubeAPIDown`|`critical`|KubeAPI has disappeared from Prometheus target discovery.
21-
|`KubeControllerManagerDown`|`critical`|KubeControllerManager has disappeared from Prometheus target discovery.
22-
|`KubeSchedulerDown`|`critical`|KubeScheduler has disappeared from Prometheus target discovery.
23-
|`KubeStateMetricsDown`|`critical`|KubeStateMetrics has disappeared from Prometheus target discovery.
24-
|`KubeletDown`|`critical`|Kubelet has disappeared from Prometheus target discovery.
25-
|`NodeExporterDown`|`critical`|NodeExporter has disappeared from Prometheus target discovery.
26-
|`PrometheusDown`|`critical`|Prometheus has disappeared from Prometheus target discovery.
27-
|`PrometheusOperatorDown`|`critical`|PrometheusOperator has disappeared from Prometheus target discovery.
28-
|`KubePodCrashLooping`|`critical`|_Namespace/Pod_ (_Container_) is restarting _times_ / second
29-
|`KubePodNotReady`|`critical`|_Namespace/Pod_ is not ready.
30-
|`KubeDeploymentGenerationMismatch`|`critical`|Deployment _Namespace/Deployment_ generation mismatch
31-
|`KubeDeploymentReplicasMismatch`|`critical`|Deployment _Namespace/Deployment_ replica mismatch
32-
|`KubeStatefulSetReplicasMismatch`|`critical`|StatefulSet _Namespace/StatefulSet_ replica mismatch
33-
|`KubeStatefulSetGenerationMismatch`|`critical`|StatefulSet _Namespace/StatefulSet_ generation mismatch
34-
|`KubeDaemonSetRolloutStuck`|`critical`|Only _X_% of desired pods scheduled and ready for daemon set _Namespace/DaemonSet_
35-
|`KubeDaemonSetNotScheduled`|`warning`|A number of pods of daemonset _Namespace/DaemonSet_ are not scheduled.
36-
|`KubeDaemonSetMisScheduled`|`warning`|A number of pods of daemonset _Namespace/DaemonSet_ are running where they are not supposed to run.
37-
|`KubeCronJobRunning`|`warning`|CronJob _Namespace/CronJob_ is taking more than 1h to complete.
38-
|`KubeJobCompletion`|`warning`|Job _Namespaces/Job_ is taking more than 1h to complete.
39-
|`KubeJobFailed`|`warning`|Job _Namespaces/Job_ failed to complete.
40-
|`KubeCPUOvercommit`|`warning`|Overcommited CPU resource requests on Pods, cannot tolerate node failure.
41-
|`KubeMemOvercommit`|`warning`|Overcommited Memory resource requests on Pods, cannot tolerate node failure.
42-
|`KubeCPUOvercommit`|`warning`|Overcommited CPU resource request quota on Namespaces.
43-
|`KubeMemOvercommit`|`warning`|Overcommited Memory resource request quota on Namespaces.
44-
|`alerKubeQuotaExceeded`|`warning`|_X_% usage of _Resource_ in namespace _Namespace_.
45-
|`KubePersistentVolumeUsageCritical`|`critical`|The persistent volume claimed by _PersistentVolumeClaim_ in namespace _Namespace_ has _X_% free.
46-
|`KubePersistentVolumeFullInFourDays`|`critical`|Based on recent sampling, the persistent volume claimed by _PersistentVolumeClaim_ in namespace _Namespace_ is expected to fill up within four days. Currently _X_ bytes are available.
47-
|`KubeNodeNotReady`|`warning`|_Node_ has been unready for more than an hour
48-
|`KubeVersionMismatch`|`warning`|There are _X_ different versions of Kubernetes components running.
49-
|`KubeClientErrors`|`warning`|Kubernetes API server client '_Job/Instance_' is experiencing _X_% errors.'
50-
|`KubeClientErrors`|`warning`|Kubernetes API server client '_Job/Instance_' is experiencing _X_ errors / sec.'
51-
|`KubeletTooManyPods`|`warning`|Kubelet _Instance_ is running _X_ pods, close to the limit of 110.
52-
|`KubeAPILatencyHigh`|`warning`|The API server has a 99th percentile latency of _X_ seconds for _Verb_ _Resource_.
53-
|`KubeAPILatencyHigh`|`critical`|The API server has a 99th percentile latency of _X_ seconds for _Verb_ _Resource_.
54-
|`KubeAPIErrorsHigh`|`critical`|API server is erroring for _X_% of requests.
55-
|`KubeAPIErrorsHigh`|`warning`|API server is erroring for _X_% of requests.
56-
|`KubeClientCertificateExpiration`|`warning`|Kubernetes API certificate is expiring in less than 7 days.
57-
|`KubeClientCertificateExpiration`|`critical`|Kubernetes API certificate is expiring in less than 1 day.
58-
|`AlertmanagerConfigInconsistent`|`critical`|Summary: Configuration out of sync. Description: The configuration of the instances of the Alertmanager cluster `_Service_` are out of sync.
59-
|`AlertmanagerFailedReload`|`warning`|Summary: Alertmanager's configuration reload failed. Description: Reloading Alertmanager's configuration has failed for _Namespace/Pod_.
60-
|`TargetDown`|`warning`|Summary: Targets are down. Description: _X_% of _Job_ targets are down.
61-
|`DeadMansSwitch`|`none`|Summary: Alerting DeadMansSwitch. Description: This is a DeadMansSwitch meant to ensure that the entire Alerting pipeline is functional.
62-
|`NodeDiskRunningFull`|`warning`|Device _Device_ of node-exporter _Namespace/Pod_ is running full within the next 24 hours.
63-
|`NodeDiskRunningFull`|`critical`|Device _Device_ of node-exporter _Namespace/Pod_ is running full within the next 2 hours.
64-
|`PrometheusConfigReloadFailed`|`warning`|Summary: Reloading Prometheus' configuration failed. Description: Reloading Prometheus' configuration has failed for _Namespace/Pod_
65-
|`PrometheusNotificationQueueRunningFull`|`warning`|Summary: Prometheus' alert notification queue is running full. Description: Prometheus' alert notification queue is running full for _Namespace/Pod_
66-
|`PrometheusErrorSendingAlerts`|`warning`|Summary: Errors while sending alert from Prometheus. Description: Errors while sending alerts from Prometheus _Namespace/Pod_ to Alertmanager _Alertmanager_
67-
|`PrometheusErrorSendingAlerts`|`critical`|Summary: Errors while sending alerts from Prometheus. Description: Errors while sending alerts from Prometheus _Namespace/Pod_ to Alertmanager _Alertmanager_
68-
|`PrometheusNotConnectedToAlertmanagers`|`warning`|Summary: Prometheus is not connected to any Alertmanagers. Description: Prometheus _Namespace/Pod_ is not connected to any Alertmanagers
69-
|`PrometheusTSDBReloadsFailing`|`warning`|Summary: Prometheus has issues reloading data blocks from disk. Description: _Job_ at _Instance_ had _X_ reload failures over the last four hours.
70-
|`PrometheusTSDBCompactionsFailing`|`warning`|Summary: Prometheus has issues compacting sample blocks. Description: _Job_ at _Instance_ had _X_ compaction failures over the last four hours.
71-
|`PrometheusTSDBWALCorruptions`|`warning`|Summary: Prometheus write-ahead log is corrupted. Description: _Job_ at _Instance_ has a corrupted write-ahead log (WAL).
72-
|`PrometheusNotIngestingSamples`|`warning`|Summary: Prometheus isn't ingesting samples. Description: Prometheus _Namespace/Pod_ isn't ingesting samples.
73-
|`PrometheusTargetScrapesDuplicate`|`warning`|Summary: Prometheus has many samples rejected. Description: _Namespace/Pod_ has many samples rejected due to duplicate timestamps but different values
74-
|`EtcdInsufficientMembers`|`critical`|Etcd cluster "_Job_": insufficient members (_X_).
75-
|`EtcdNoLeader`|`critical`|Etcd cluster "_Job_": member _Instance_ has no leader.
76-
|`EtcdHighNumberOfLeaderChanges`|`warning`|Etcd cluster "_Job_": instance _Instance_ has seen _X_ leader changes within the last hour.
77-
|`EtcdHighNumberOfFailedGRPCRequests`|`warning`|Etcd cluster "_Job_": _X_% of requests for _GRPC_Method_ failed on etcd instance _Instance_.
78-
|`EtcdHighNumberOfFailedGRPCRequests`|`critical`|Etcd cluster "_Job_": _X_% of requests for _GRPC_Method_ failed on etcd instance _Instance_.
79-
|`EtcdGRPCRequestsSlow`|`critical`|Etcd cluster "_Job_": gRPC requests to _GRPC_Method_ are taking _X_s on etcd instance _Instance_.
80-
|`EtcdMemberCommunicationSlow`|`warning`|Etcd cluster "_Job_": member communication with _To_ is taking _X_s on etcd instance _Instance_.
81-
|`EtcdHighNumberOfFailedProposals`|`warning`|Etcd cluster "_Job_": _X_ proposal failures within the last hour on etcd instance _Instance_.
82-
|`EtcdHighFsyncDurations`|`warning`|Etcd cluster "_Job_": 99th percentile fync durations are _X_s on etcd instance _Instance_.
83-
|`EtcdHighCommitDurations`|`warning`|Etcd cluster "_Job_": 99th percentile commit durations _X_s on etcd instance _Instance_.
84-
|`FdExhaustionClose`|`warning`|_Job_ instance _Instance_ will exhaust its file descriptors soon
85-
|`FdExhaustionClose`|`critical`|_Job_ instance _Instance_ will exhaust its file descriptors soon
86-
|===
17+
.Additional resources
8718

19+
* See the link:https://github.com/openshift/cluster-monitoring-operator/blob/master/Documentation/user-guides/default-alerts.md[default alerts table].

0 commit comments

Comments
 (0)