Merge pull request #86262 from slovern/TELCODOCS-2002

stevsmit · web-flow · commit 0e3e4288a840 · 2024-12-18T13:38:11.000-05:00
TELCODOCS-2002 -  Telco Day 2 Operations - Observability - ContentX improvements
diff --git a/_topic_maps/_topic_map.yml b/_topic_maps/_topic_map.yml
@@ -3462,6 +3462,11 @@ Topics:
       File: telco-troubleshooting-mco
     - Name: Bare-metal node maintenance
       File: telco-troubleshooting-bmn-maintenance
+  - Name: Observability
+    Dir: observability
+    Topics:
+    - Name: Observability in OpenShift Container Platform
+      File: telco-observability
 ---
 Name: Specialized hardware and driver enablement
 Dir: hardware_enablement
diff --git a/edge_computing/day_2_core_cnf_clusters/observability/_attributes b/edge_computing/day_2_core_cnf_clusters/observability/_attributes
@@ -0,0 +1 @@
+../../../_attributes/
diff --git a/edge_computing/day_2_core_cnf_clusters/observability/images b/edge_computing/day_2_core_cnf_clusters/observability/images
@@ -0,0 +1 @@
+../../../images/
diff --git a/edge_computing/day_2_core_cnf_clusters/observability/modules b/edge_computing/day_2_core_cnf_clusters/observability/modules
@@ -0,0 +1 @@
+../../../modules/
diff --git a/edge_computing/day_2_core_cnf_clusters/observability/snippets b/edge_computing/day_2_core_cnf_clusters/observability/snippets
@@ -0,0 +1 @@
+../../../snippets/
diff --git a/edge_computing/day_2_core_cnf_clusters/observability/telco-observability.adoc b/edge_computing/day_2_core_cnf_clusters/observability/telco-observability.adoc
@@ -0,0 +1,52 @@
+:_mod-docs-content-type: ASSEMBLY
+[id="telco-observability"]
+= Observability in {product-title}
+include::_attributes/common-attributes.adoc[]
+:context: telco-observability
+:imagesdir: images
+
+toc::[]
+
+{product-title} generates a large amount of data, such as performance metrics and logs from both the platform and the the workloads running on it. 
+As an administrator, you can use various tools to collect and analyze all the data available. 
+What follows is an outline of best practices for system engineers, architects, and administrators configuring the observability stack. 
+
+Unless explicitly stated, the material in this document refers to both Edge and Core deployments.
+
+include::modules/telco-observability-monitoring-stack.adoc[leveloffset=+1]
+
+[role="_additional-resources"]
+.Additional resources
+
+* xref:../../../observability/monitoring/monitoring-overview.adoc#understanding-the-monitoring-stack_monitoring-overview[Understanding the monitoring stack]
+
+* xref:../../../observability/monitoring/configuring-the-monitoring-stack.adoc#configuring-the-monitoring-stack[Configuring the monitoring stack]
+
+include::modules/telco-observability-key-performance-metrics.adoc[leveloffset=+1]
+
+[role="_additional-resources"]
+.Additional resources
+
+* xref:../../../observability/monitoring/managing-metrics.adoc#managing-metrics[Managing metrics]
+* xref:../../../storage/persistent_storage/persistent_storage_local/persistent-storage-local.adoc#local-storage-install_persistent-storage-local[Persistent storage using local volumes]
+* xref:../../../scalability_and_performance/telco_ref_design_specs/ran/telco-ran-ref-du-crs.adoc#cluster-tuning-crs_ran-ref-design-crs[Cluster tuning reference CRs]
+
+include::modules/telco-observability-monitoring-the-edge.adoc[leveloffset=+1]
+
+include::modules/telco-observability-alerting.adoc[leveloffset=+1]
+
+[role="_additional-resources"]
+.Additional resources
+
+* xref:../../../observability/monitoring/managing-alerts.adoc#managing-alerts[Managing alerts]
+
+include::modules/telco-observability-workload-monitoring.adoc[leveloffset=+1]
+
+[role="_additional-resources"]
+.Additional resources
+
+* xref:../../../rest_api/monitoring_apis/servicemonitor-monitoring-coreos-com-v1.adoc#servicemonitor-monitoring-coreos-com-v1[ServiceMonitor[monitoring.coreos.com/v1]]
+
+* xref:../../../observability/monitoring/enabling-monitoring-for-user-defined-projects.adoc#enabling-monitoring-for-user-defined-projects[Enabling monitoring for user-defined projects]
+
+* xref:../../../observability/monitoring/managing-alerts.adoc#managing-alerting-rules-for-user-defined-projects_managing-alerts[Managing alerting rules for user-defined projects]
diff --git a/modules/telco-observability-alerting.adoc b/modules/telco-observability-alerting.adoc
@@ -0,0 +1,53 @@
+// Module included in the following assemblies:
+//
+// * edge_computing/day_2_core_cnf_clusters/observability/telco-observability.adoc
+
+:_mod-docs-content-type: PROCEDURE
+[id="telco-observability-alerting_{context}"]
+
+= Alerting
+
+{product-title} includes a large number of alert rules, which can change from release to release. 
+
+[id="viewing-default-alerts"]
+== Viewing default alerts
+
+Use the following procedure to review all of the alert rules in a cluster.
+
+.Procedure
+
+* To review all the alert rules in a cluster, you can run the following command:
+[source,terminal]
++
+----
+$ oc get cm -n openshift-monitoring prometheus-k8s-rulefiles-0 -o yaml
+----
++
+Rules can include a description and provide a link to additional information and mitigation steps. 
+For example, this is the rule for `etcdHighFsyncDurations`:
++
+[source,terminal]
+----
+      - alert: etcdHighFsyncDurations
+        annotations:
+          description: 'etcd cluster "{{ $labels.job }}": 99th percentile fsync durations
+            are {{ $value }}s on etcd instance {{ $labels.instance }}.'
+          runbook_url: https://github.com/openshift/runbooks/blob/master/alerts/cluster-etcd-operator/etcdHighFsyncDurations.md
+          summary: etcd cluster 99th percentile fsync durations are too high.
+        expr: |
+          histogram_quantile(0.99, rate(etcd_disk_wal_fsync_duration_seconds_bucket{job=~".*etcd.*"}[5m]))
+          > 1
+        for: 10m
+        labels:
+          severity: critical
+----
+
+[id="alert-notifications"]
+== Alert notifications  
+You can view alerts in the {product-title} console, however an administrator should configure an external receiver to forward the alerts to. 
+{product-title} supports the following receiver types:
+
+* PagerDuty: a 3rd party incident response platform
+* Webhook: an arbitrary API endpoint that receives an alert via a POST request and can take any necessary action
+* Email: sends an email to designated address
+* Slack: sends a notification to either a slack channel or an individual user
diff --git a/modules/telco-observability-key-performance-metrics.adoc b/modules/telco-observability-key-performance-metrics.adoc
@@ -0,0 +1,143 @@
+// Module included in the following assemblies:
+//
+// * edge_computing/day_2_core_cnf_clusters/observability/telco-observability.adoc
+
+:_mod-docs-content-type: CONCEPT
+[id="telco-observability-key-performance-metrics_{context}"]
+= Key performance metrics
+
+Depending on your system, there can be hundreds of available measurements.
+
+Here are some key metrics that you should pay attention to:
+
+* `etcd` response times
+* API response times
+* Pod restarts and scheduling
+* Resource usage
+* OVN health
+* Overall cluster operator health
+
+A good rule to follow is that if you decide that a metric is important, there should be an alert for it. 
+
+[NOTE]
+====
+You can check the available metrics by runnning the following command:
+[source,terminal]
+----
+$ oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -qsk http://localhost:9090/api/v1/metadata | jq '.data
+----
+====
+
+[id="example-queries-promql"]
+== Example queries in PromQL
+
+The following tables show some queries that you can explore in the metrics query browser using the {product-title} console. 
+
+[NOTE]
+====
+The URL for the console is https://<OpenShift Console FQDN>/monitoring/query-browser.
+You can get the Openshift Console FQDN by runnning the following command:
+[source,terminal]
+----
+$ oc get routes -n openshift-console console -o jsonpath='{.status.ingress[0].host}'
+----
+====
+
+.Node memory & CPU usage
+[options="header"]
+|===
+
+|Metric|Query
+
+|CPU % requests by node
+|`sum by (node) (sum_over_time(kube_pod_container_resource_requests{resource="cpu"}[60m]))/sum by (node) (sum_over_time(kube_node_status_allocatable{resource="cpu"}[60m])) *100`
+
+|Overall cluster CPU % utilization
+|`sum by (managed_cluster) (sum_over_time(kube_pod_container_resource_requests{resource="memory"}[60m]))/sum by (managed_cluster) (sum_over_time(kube_node_status_allocatable{resource="cpu"}[60m])) *100`
+
+
+|Memory % requests by node
+|`sum by (node) (sum_over_time(kube_pod_container_resource_requests{resource="memory"}[60m]))/sum by (node) (sum_over_time(kube_node_status_allocatable{resource="memory"}[60m])) *100`
+
+|Overall cluster memory % utilization
+|`(1-(sum by (managed_cluster)(avg_over_time((node_memory_MemAvailable_bytes[60m])) ))/sum by (managed_cluster)(avg_over_time(kube_node_status_allocatable{resource="memory"}[60m])))*100`
+
+|===
+
+.API latency by verb
+[options="header"]
+|===
+
+|Metric|Query
+
+|`GET`
+|`histogram_quantile (0.99, sum by (le,managed_cluster) (sum_over_time(apiserver_request_duration_seconds_bucket{apiserver=~"kube-apiserver\|openshift-apiserver", verb="GET"}[60m])))`
+
+|`PATCH`
+|`histogram_quantile (0.99, sum by (le,managed_cluster) (sum_over_time(apiserver_request_duration_seconds_bucket{apiserver="kube-apiserver\|openshift-apiserver", verb="PATCH"}[60m])))`
+
+|`POST`
+|`histogram_quantile (0.99, sum by (le,managed_cluster) (sum_over_time(apiserver_request_duration_seconds_bucket{apiserver="kube-apiserver\|openshift-apiserver", verb="POST"}[60m])))`
+
+|`LIST` 
+|`histogram_quantile (0.99, sum by (le,managed_cluster) (sum_over_time(apiserver_request_duration_seconds_bucket{apiserver="kube-apiserver\|openshift-apiserver", verb="LIST"}[60m])))`
+
+|`PUT`
+|`histogram_quantile (0.99, sum by (le,managed_cluster) (sum_over_time(apiserver_request_duration_seconds_bucket{apiserver="kube-apiserver\|openshift-apiserver", verb="PUT"}[60m])))`
+
+|`DELETE`
+|`histogram_quantile (0.99, sum by (le,managed_cluster) (sum_over_time(apiserver_request_duration_seconds_bucket{apiserver="kube-apiserver\|openshift-apiserver", verb="DELETE"}[60m])))`
+
+|Combined
+|`histogram_quantile(0.99, sum by (le,managed_cluster) (sum_over_time(apiserver_request_duration_seconds_bucket{apiserver=~"(openshift-apiserver\|kube-apiserver)", verb!="WATCH"}[60m])))`
+
+|===
+
+.`etcd`
+[options="header"]
+|===
+
+|Metric|Query
+
+|`fsync` 99th percentile latency (per instance)
+|`histogram_quantile(0.99, rate(etcd_disk_wal_fsync_duration_seconds_bucket[2m]))`
+
+|`fsync` 99th percentile latency (per cluster)
+|`sum by (managed_cluster) ( histogram_quantile(0.99, rate(etcd_disk_wal_fsync_duration_seconds_bucket[60m])))`
+
+|Leader elections
+|`sum(rate(etcd_server_leader_changes_seen_total[1440m]))`
+
+|Network latency
+|`histogram_quantile(0.99, rate(etcd_network_peer_round_trip_time_seconds_bucket[5m]))`
+
+|===
+
+.Operator health
+[options="header"]
+|===
+
+|Metric|Query
+
+|Degraded operators
+|`sum by (managed_cluster, name) (avg_over_time(cluster_operator_conditions{condition="Degraded", name!="version"}[60m]))`
+
+|Total degraded operators per cluster
+|`sum by (managed_cluster) (avg_over_time(cluster_operator_conditions{condition="Degraded", name!="version"}[60m] ))`
+
+|===
+
+[id="recommendations-for-storage-of-metrics"]
+== Recommendations for storage of metrics
+
+Out of the box, Prometheus does not back up saved metrics with persistent storage. 
+If you restart the Prometheus pods, all metrics data are lost. 
+You should configure the monitoring stack to use the back-end storage that is available on the platform. 
+To meet the high IO demands of Prometheus you should use local storage.
+
+For Telco core clusters, you can use the Local Storage Operator for persistent storage for Prometheus. 
+
+{odf-first}, which deploys a ceph cluster for block, file, and object storage, is also a suitable candidate for a Telco core cluster.
+
+To keep system resource requirements low on a RAN {sno} or far edge cluster, you should not provision backend storage for the monitoring stack. 
+Such clusters forward all metrics to the hub cluster where you can provision a third party monitoring platform.
diff --git a/modules/telco-observability-monitoring-stack.adoc b/modules/telco-observability-monitoring-stack.adoc
@@ -0,0 +1,21 @@
+// Module included in the following assemblies:
+//
+// * edge_computing/day_2_core_cnf_clusters/observability/telco-observability.adoc
+
+:_mod-docs-content-type: CONCEPT
+[id="telco-observability-monitoring-stack_{context}"]
+= Understanding the monitoring stack
+
+The monitoring stack uses the following components:
+
+* Prometheus collects and analyzes metrics from {product-title} components and from workloads, if configured to do so.
+* Alertmanager is a component of Prometheus that handles routing, grouping, and silencing of alerts.
+* Thanos handles long term storage of metrics.
+
+.{product-title} monitoring architecture
+image::monitoring-architecture.png[{product-title} monitoring architecture]
+
+[NOTE]
+====
+For a {sno} cluster, you should disable Alertmanager and Thanos because the cluster sends all metrics to the hub cluster for analysis and retention.
+====
diff --git a/modules/telco-observability-monitoring-the-edge.adoc b/modules/telco-observability-monitoring-the-edge.adoc
diff --git a/modules/telco-observability-workload-monitoring.adoc b/modules/telco-observability-workload-monitoring.adoc