openshift
diff --git a/‎_topic_map.yml
Lines changed: 7 additions & 7 deletions b/‎_topic_map.yml
Lines changed: 7 additions & 7 deletions
diff --git a/‎images/alert-overview.png
62 KB b/‎images/alert-overview.png
62 KB
diff --git a/‎images/alerting-rule-overview.png
62.2 KB b/‎images/alerting-rule-overview.png
62.2 KB
diff --git a/‎images/alerts-screen.png
108 KB b/‎images/alerts-screen.png
108 KB
diff --git a/‎images/create-silence.png
41.4 KB b/‎images/create-silence.png
41.4 KB
diff --git a/‎images/silence-overview.png
64.4 KB b/‎images/silence-overview.png
64.4 KB
diff --git a/‎images/silences-screen.png
74.7 KB b/‎images/silences-screen.png
74.7 KB
diff --git a/‎modules/monitoring-about-cluster-monitoring.adoc
Lines changed: 60 additions & 0 deletions b/‎modules/monitoring-about-cluster-monitoring.adoc
Lines changed: 60 additions & 0 deletions
diff --git a/‎modules/monitoring-accessing-prometheus-alertmanager-grafana.adoc
Lines changed: 2 additions & 4 deletions b/‎modules/monitoring-accessing-prometheus-alertmanager-grafana.adoc
Lines changed: 2 additions & 4 deletions
diff --git a/‎modules/monitoring-alerting-rules.adoc
Lines changed: 10 additions & 78 deletions b/‎modules/monitoring-alerting-rules.adoc
Lines changed: 10 additions & 78 deletions
@@ -499,10 +499,10 @@ Topics:
 - Name: Administrator CLI commands
   File: administrator-cli-commands
   Distros: openshift-enterprise,openshift-origin,openshift-dedicated
-#---
-#Name: Monitoring
-#Dir: monitoring
-#Distros: openshift-*
-#Topics:
-#- Name: Prometheus cluster monitoring
-#  File: prometheus-cluster-monitoring
+---
+Name: Monitoring
+Dir: monitoring
+Distros: openshift-*
+Topics:
+- Name: Monitoring
+  File: monitoring
@@ -0,0 +1,60 @@
+// Module included in the following assemblies:
+//
+// * monitoring/monitoring.adoc
+
+[id='monitoring-about-cluster-monitoring-{context}']
+= About cluster monitoring
+
+{product-title} includes a pre-configured, pre-installed, and self-updating monitoring stack that is based on the link:https://prometheus.io/[Prometheus] open source project and its wider eco-system. It provides monitoring of cluster components and ships with a set of alerts to immediately notify the cluster administrator about any occurring problems and a set of link:https://grafana.com/[Grafana] dashboards.
+
+The monitoring stack includes these components:
+
+* Cluster Monitoring Operator
+* Prometheus Operator
+* Prometheus
+* Prometheus Adapter
+* Alertmanager
+* `kube-state-metrics`
+* `node-exporter`
+* Grafana
+
+The {product-title} Cluster Monitoring Operator (CMO) is the central component of the stack. It controls the deployed monitoring components and resources and ensures that they are always up to date.
+
+The Prometheus Operator (PO) creates, configures, and manages Prometheus and Alertmanager instances. It also automatically generates monitoring target configurations based on familiar Kubernetes label queries.
+
+The Prometheus Adapter exposes cluster resource metrics API for horizontal pod autoscaling. Resource metrics are CPU and memory utilization.
+
+`node-exporter` is an agent deployed on every node to collect metrics about it.
+
+The `kube-state-metrics` exporter agent converts Kubernetes objects to metrics that Prometheus can use.
+
+All the components of the monitoring stack are monitored by the stack and are automatically updated when {product-title} is updated.
+
+In addition to the components of the stack itself, the monitoring stack monitors:
+
+* `cluster-version-operator`
+* `image-registry`
+* `kube-apiserver`
+* `kube-apiserver-operator`
+* `kube-controller-manager`
+* `kube-controller-manager-operator`
+* `kube-scheduler`
+* `kubelet`
+* `monitor-sdn`
+* `openshift-apiserver`
+* `openshift-apiserver-operator`
+* `openshift-controller-manager`
+* `openshift-controller-manager-operator`
+* `openshift-svcat-controller-manager-operator`
+* `telemeter-client`
+
+Other {product-title} framework components might be exposing metrics as well. For details, see their respective documentation.
+
+[NOTE]
+====
+To ensure compatibility with future {product-title} updates, configuring only the specified monitoring stack options is supported.
+====
+
+.Additional resources
+
+For more information about the {product-title} Cluster Monitoring Operator, see the link:https://github.com/openshift/cluster-monitoring-operator[Cluster Monitoring Operator] GitHub project.
@@ -1,19 +1,17 @@
 // Module included in the following assemblies:
 //
-// * monitoring/installing-monitoring-stack.adoc
+// * monitoring/configuring-the-monitoring-stack.adoc
 
 [id="accessing-prometheus-alertmanager-and-grafana-{context}"]
 = Accessing Prometheus, Alertmanager, and Grafana
 
-{product-title} Monitoring ships with a Prometheus instance for cluster monitoring and a central Alertmanager cluster. In addition to Prometheus and Alertmanager, {product-title} Monitoring also includes a https://grafana.com/[Grafana] instance as well as pre-built dashboards for cluster monitoring troubleshooting.
-
 You can get the addresses for accessing Prometheus, Alertmanager, and Grafana web UIs.
 
 .Procedure
 
 * Run:
 +
-[subs="quotes"]
+[subs=quotes]
   $ oc -n openshift-monitoring get routes
   NAME                HOST/PORT                                                     ...
   alertmanager-main   alertmanager-main-openshift-monitoring.apps._url_.openshift.com ...
 
@@ -1,87 +1,19 @@
 // Module included in the following assemblies:
 //
-// * monitoring/installing-monitoring-stack.adoc
+// * monitoring/configuring-the-monitoring-stack.adoc
 
 [id="alerting-rules-{context}"]
-== Alerting rules
+= Alerting rules
 
-{product-title} Cluster Monitoring ships with the following alerting rules configured by default. Currently you cannot add custom alerting rules.
+{product-title} Cluster Monitoring by default ships with a set of pre-defined alerting rules.
 
-Some alerting rules have identical names. This is intentional. They are alerting about the same event with different thresholds, with different severity, or both. With the inhibition rules, the lower severity is inhibited when the higher severity is firing.
+Note that:
 
-For more details on the alerting rules, see the link:https://github.com/openshift/cluster-monitoring-operator/blob/master/assets/prometheus-k8s/rules.yaml[configuration file].
+* The default alerting rules are used specifically for the {product-title} cluster and nothing else. For example, you get alerts for a persistent volume in the cluster, but you do not get them for persistent volume in your custom namespace.
+* Currently you cannot add custom alerting rules.
+* Some alerting rules have identical names. This is intentional. They are sending alerts about the same event with different thresholds, with different severity, or both.
+* With the inhibition rules, the lower severity is inhibited when the higher severity is firing.
 
-[options="header"]
-|===
-|Alert|Severity|Description
-|`ClusterMonitoringOperatorErrors`|`critical`|Cluster Monitoring Operator is experiencing _X_% errors.
-|`AlertmanagerDown`|`critical`|Alertmanager has disappeared from Prometheus target discovery.
-|`ClusterMonitoringOperatorDown`|`critical`|ClusterMonitoringOperator has disappeared from Prometheus target discovery.
-|`KubeAPIDown`|`critical`|KubeAPI has disappeared from Prometheus target discovery.
-|`KubeControllerManagerDown`|`critical`|KubeControllerManager has disappeared from Prometheus target discovery.
-|`KubeSchedulerDown`|`critical`|KubeScheduler has disappeared from Prometheus target discovery.
-|`KubeStateMetricsDown`|`critical`|KubeStateMetrics has disappeared from Prometheus target discovery.
-|`KubeletDown`|`critical`|Kubelet has disappeared from Prometheus target discovery.
-|`NodeExporterDown`|`critical`|NodeExporter has disappeared from Prometheus target discovery.
-|`PrometheusDown`|`critical`|Prometheus has disappeared from Prometheus target discovery.
-|`PrometheusOperatorDown`|`critical`|PrometheusOperator has disappeared from Prometheus target discovery.
-|`KubePodCrashLooping`|`critical`|_Namespace/Pod_ (_Container_) is restarting _times_ / second
-|`KubePodNotReady`|`critical`|_Namespace/Pod_ is not ready.
-|`KubeDeploymentGenerationMismatch`|`critical`|Deployment _Namespace/Deployment_ generation mismatch
-|`KubeDeploymentReplicasMismatch`|`critical`|Deployment _Namespace/Deployment_ replica mismatch
-|`KubeStatefulSetReplicasMismatch`|`critical`|StatefulSet _Namespace/StatefulSet_ replica mismatch
-|`KubeStatefulSetGenerationMismatch`|`critical`|StatefulSet _Namespace/StatefulSet_ generation mismatch
-|`KubeDaemonSetRolloutStuck`|`critical`|Only _X_% of desired pods scheduled and ready for daemon set _Namespace/DaemonSet_
-|`KubeDaemonSetNotScheduled`|`warning`|A number of pods of daemonset _Namespace/DaemonSet_ are not scheduled.
-|`KubeDaemonSetMisScheduled`|`warning`|A number of pods of daemonset _Namespace/DaemonSet_ are running where they are not supposed to run.
-|`KubeCronJobRunning`|`warning`|CronJob _Namespace/CronJob_ is taking more than 1h to complete.
-|`KubeJobCompletion`|`warning`|Job _Namespaces/Job_ is taking more than 1h to complete.
-|`KubeJobFailed`|`warning`|Job _Namespaces/Job_ failed to complete.
-|`KubeCPUOvercommit`|`warning`|Overcommited CPU resource requests on Pods, cannot tolerate node failure.
-|`KubeMemOvercommit`|`warning`|Overcommited Memory resource requests on Pods, cannot tolerate node failure.
-|`KubeCPUOvercommit`|`warning`|Overcommited CPU resource request quota on Namespaces.
-|`KubeMemOvercommit`|`warning`|Overcommited Memory resource request quota on Namespaces.
-|`alerKubeQuotaExceeded`|`warning`|_X_% usage of _Resource_ in namespace _Namespace_.
-|`KubePersistentVolumeUsageCritical`|`critical`|The persistent volume claimed by _PersistentVolumeClaim_ in namespace _Namespace_ has _X_% free.
-|`KubePersistentVolumeFullInFourDays`|`critical`|Based on recent sampling, the persistent volume claimed by _PersistentVolumeClaim_ in namespace _Namespace_ is expected to fill up within four days. Currently _X_ bytes are available.
-|`KubeNodeNotReady`|`warning`|_Node_ has been unready for more than an hour
-|`KubeVersionMismatch`|`warning`|There are _X_ different versions of Kubernetes components running.
-|`KubeClientErrors`|`warning`|Kubernetes API server client '_Job/Instance_' is experiencing _X_% errors.'
-|`KubeClientErrors`|`warning`|Kubernetes API server client '_Job/Instance_' is experiencing _X_ errors / sec.'
-|`KubeletTooManyPods`|`warning`|Kubelet _Instance_ is running _X_ pods, close to the limit of 110.
-|`KubeAPILatencyHigh`|`warning`|The API server has a 99th percentile latency of _X_ seconds for _Verb_ _Resource_.
-|`KubeAPILatencyHigh`|`critical`|The API server has a 99th percentile latency of _X_ seconds for _Verb_ _Resource_.
-|`KubeAPIErrorsHigh`|`critical`|API server is erroring for _X_% of requests.
-|`KubeAPIErrorsHigh`|`warning`|API server is erroring for _X_% of requests.
-|`KubeClientCertificateExpiration`|`warning`|Kubernetes API certificate is expiring in less than 7 days.
-|`KubeClientCertificateExpiration`|`critical`|Kubernetes API certificate is expiring in less than 1 day.
-|`AlertmanagerConfigInconsistent`|`critical`|Summary: Configuration out of sync. Description: The configuration of the instances of the Alertmanager cluster `_Service_` are out of sync.
-|`AlertmanagerFailedReload`|`warning`|Summary: Alertmanager's configuration reload failed. Description: Reloading Alertmanager's configuration has failed for _Namespace/Pod_.
-|`TargetDown`|`warning`|Summary: Targets are down. Description: _X_% of _Job_ targets are down.
-|`DeadMansSwitch`|`none`|Summary: Alerting DeadMansSwitch. Description: This is a DeadMansSwitch meant to ensure that the entire Alerting pipeline is functional.
-|`NodeDiskRunningFull`|`warning`|Device _Device_ of node-exporter _Namespace/Pod_ is running full within the next 24 hours.
-|`NodeDiskRunningFull`|`critical`|Device _Device_ of node-exporter _Namespace/Pod_ is running full within the next 2 hours.
-|`PrometheusConfigReloadFailed`|`warning`|Summary: Reloading Prometheus' configuration failed. Description: Reloading Prometheus' configuration has failed for _Namespace/Pod_
-|`PrometheusNotificationQueueRunningFull`|`warning`|Summary: Prometheus' alert notification queue is running full. Description: Prometheus' alert notification queue is running full for _Namespace/Pod_
-|`PrometheusErrorSendingAlerts`|`warning`|Summary: Errors while sending alert from Prometheus. Description: Errors while sending alerts from Prometheus _Namespace/Pod_ to Alertmanager _Alertmanager_
-|`PrometheusErrorSendingAlerts`|`critical`|Summary: Errors while sending alerts from Prometheus. Description: Errors while sending alerts from Prometheus _Namespace/Pod_ to Alertmanager _Alertmanager_
-|`PrometheusNotConnectedToAlertmanagers`|`warning`|Summary: Prometheus is not connected to any Alertmanagers. Description: Prometheus _Namespace/Pod_ is not connected to any Alertmanagers
-|`PrometheusTSDBReloadsFailing`|`warning`|Summary: Prometheus has issues reloading data blocks from disk. Description: _Job_ at _Instance_ had _X_ reload failures over the last four hours.
-|`PrometheusTSDBCompactionsFailing`|`warning`|Summary: Prometheus has issues compacting sample blocks. Description: _Job_ at _Instance_ had _X_ compaction failures over the last four hours.
-|`PrometheusTSDBWALCorruptions`|`warning`|Summary: Prometheus write-ahead log is corrupted. Description: _Job_ at _Instance_ has a corrupted write-ahead log (WAL).
-|`PrometheusNotIngestingSamples`|`warning`|Summary: Prometheus isn't ingesting samples. Description: Prometheus _Namespace/Pod_ isn't ingesting samples.
-|`PrometheusTargetScrapesDuplicate`|`warning`|Summary: Prometheus has many samples rejected. Description: _Namespace/Pod_ has many samples rejected due to duplicate timestamps but different values
-|`EtcdInsufficientMembers`|`critical`|Etcd cluster "_Job_": insufficient members (_X_).
-|`EtcdNoLeader`|`critical`|Etcd cluster "_Job_": member _Instance_ has no leader.
-|`EtcdHighNumberOfLeaderChanges`|`warning`|Etcd cluster "_Job_": instance _Instance_ has seen _X_ leader changes within the last hour.
-|`EtcdHighNumberOfFailedGRPCRequests`|`warning`|Etcd cluster "_Job_": _X_% of requests for _GRPC_Method_ failed on etcd instance _Instance_.
-|`EtcdHighNumberOfFailedGRPCRequests`|`critical`|Etcd cluster "_Job_": _X_% of requests for _GRPC_Method_ failed on etcd instance _Instance_.
-|`EtcdGRPCRequestsSlow`|`critical`|Etcd cluster "_Job_": gRPC requests to _GRPC_Method_ are taking _X_s on etcd instance _Instance_.
-|`EtcdMemberCommunicationSlow`|`warning`|Etcd cluster "_Job_": member communication with _To_ is taking _X_s on etcd instance _Instance_.
-|`EtcdHighNumberOfFailedProposals`|`warning`|Etcd cluster "_Job_": _X_ proposal failures within the last hour on etcd instance _Instance_.
-|`EtcdHighFsyncDurations`|`warning`|Etcd cluster "_Job_": 99th percentile fync durations are _X_s on etcd instance _Instance_.
-|`EtcdHighCommitDurations`|`warning`|Etcd cluster "_Job_": 99th percentile commit durations _X_s on etcd instance _Instance_.
-|`FdExhaustionClose`|`warning`|_Job_ instance _Instance_ will exhaust its file descriptors soon
-|`FdExhaustionClose`|`critical`|_Job_ instance _Instance_ will exhaust its file descriptors soon
-|===
+.Additional resources
 
+* See the link:https://github.com/openshift/cluster-monitoring-operator/blob/master/Documentation/user-guides/default-alerts.md[default alerts table].