Skip to content

Commit 0e3e428

Browse files
authored
Merge pull request #86262 from slovern/TELCODOCS-2002
TELCODOCS-2002 - Telco Day 2 Operations - Observability - ContentX improvements
2 parents 5153ad3 + 49ef6b9 commit 0e3e428

11 files changed

+621
-0
lines changed

_topic_maps/_topic_map.yml

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3462,6 +3462,11 @@ Topics:
34623462
File: telco-troubleshooting-mco
34633463
- Name: Bare-metal node maintenance
34643464
File: telco-troubleshooting-bmn-maintenance
3465+
- Name: Observability
3466+
Dir: observability
3467+
Topics:
3468+
- Name: Observability in OpenShift Container Platform
3469+
File: telco-observability
34653470
---
34663471
Name: Specialized hardware and driver enablement
34673472
Dir: hardware_enablement
Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
../../../_attributes/
Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
../../../images/
Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
../../../modules/
Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
../../../snippets/
Lines changed: 52 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,52 @@
1+
:_mod-docs-content-type: ASSEMBLY
2+
[id="telco-observability"]
3+
= Observability in {product-title}
4+
include::_attributes/common-attributes.adoc[]
5+
:context: telco-observability
6+
:imagesdir: images
7+
8+
toc::[]
9+
10+
{product-title} generates a large amount of data, such as performance metrics and logs from both the platform and the the workloads running on it.
11+
As an administrator, you can use various tools to collect and analyze all the data available.
12+
What follows is an outline of best practices for system engineers, architects, and administrators configuring the observability stack.
13+
14+
Unless explicitly stated, the material in this document refers to both Edge and Core deployments.
15+
16+
include::modules/telco-observability-monitoring-stack.adoc[leveloffset=+1]
17+
18+
[role="_additional-resources"]
19+
.Additional resources
20+
21+
* xref:../../../observability/monitoring/monitoring-overview.adoc#understanding-the-monitoring-stack_monitoring-overview[Understanding the monitoring stack]
22+
23+
* xref:../../../observability/monitoring/configuring-the-monitoring-stack.adoc#configuring-the-monitoring-stack[Configuring the monitoring stack]
24+
25+
include::modules/telco-observability-key-performance-metrics.adoc[leveloffset=+1]
26+
27+
[role="_additional-resources"]
28+
.Additional resources
29+
30+
* xref:../../../observability/monitoring/managing-metrics.adoc#managing-metrics[Managing metrics]
31+
* xref:../../../storage/persistent_storage/persistent_storage_local/persistent-storage-local.adoc#local-storage-install_persistent-storage-local[Persistent storage using local volumes]
32+
* xref:../../../scalability_and_performance/telco_ref_design_specs/ran/telco-ran-ref-du-crs.adoc#cluster-tuning-crs_ran-ref-design-crs[Cluster tuning reference CRs]
33+
34+
include::modules/telco-observability-monitoring-the-edge.adoc[leveloffset=+1]
35+
36+
include::modules/telco-observability-alerting.adoc[leveloffset=+1]
37+
38+
[role="_additional-resources"]
39+
.Additional resources
40+
41+
* xref:../../../observability/monitoring/managing-alerts.adoc#managing-alerts[Managing alerts]
42+
43+
include::modules/telco-observability-workload-monitoring.adoc[leveloffset=+1]
44+
45+
[role="_additional-resources"]
46+
.Additional resources
47+
48+
* xref:../../../rest_api/monitoring_apis/servicemonitor-monitoring-coreos-com-v1.adoc#servicemonitor-monitoring-coreos-com-v1[ServiceMonitor[monitoring.coreos.com/v1]]
49+
50+
* xref:../../../observability/monitoring/enabling-monitoring-for-user-defined-projects.adoc#enabling-monitoring-for-user-defined-projects[Enabling monitoring for user-defined projects]
51+
52+
* xref:../../../observability/monitoring/managing-alerts.adoc#managing-alerting-rules-for-user-defined-projects_managing-alerts[Managing alerting rules for user-defined projects]
Lines changed: 53 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,53 @@
1+
// Module included in the following assemblies:
2+
//
3+
// * edge_computing/day_2_core_cnf_clusters/observability/telco-observability.adoc
4+
5+
:_mod-docs-content-type: PROCEDURE
6+
[id="telco-observability-alerting_{context}"]
7+
8+
= Alerting
9+
10+
{product-title} includes a large number of alert rules, which can change from release to release.
11+
12+
[id="viewing-default-alerts"]
13+
== Viewing default alerts
14+
15+
Use the following procedure to review all of the alert rules in a cluster.
16+
17+
.Procedure
18+
19+
* To review all the alert rules in a cluster, you can run the following command:
20+
[source,terminal]
21+
+
22+
----
23+
$ oc get cm -n openshift-monitoring prometheus-k8s-rulefiles-0 -o yaml
24+
----
25+
+
26+
Rules can include a description and provide a link to additional information and mitigation steps.
27+
For example, this is the rule for `etcdHighFsyncDurations`:
28+
+
29+
[source,terminal]
30+
----
31+
- alert: etcdHighFsyncDurations
32+
annotations:
33+
description: 'etcd cluster "{{ $labels.job }}": 99th percentile fsync durations
34+
are {{ $value }}s on etcd instance {{ $labels.instance }}.'
35+
runbook_url: https://github.com/openshift/runbooks/blob/master/alerts/cluster-etcd-operator/etcdHighFsyncDurations.md
36+
summary: etcd cluster 99th percentile fsync durations are too high.
37+
expr: |
38+
histogram_quantile(0.99, rate(etcd_disk_wal_fsync_duration_seconds_bucket{job=~".*etcd.*"}[5m]))
39+
> 1
40+
for: 10m
41+
labels:
42+
severity: critical
43+
----
44+
45+
[id="alert-notifications"]
46+
== Alert notifications
47+
You can view alerts in the {product-title} console, however an administrator should configure an external receiver to forward the alerts to.
48+
{product-title} supports the following receiver types:
49+
50+
* PagerDuty: a 3rd party incident response platform
51+
* Webhook: an arbitrary API endpoint that receives an alert via a POST request and can take any necessary action
52+
* Email: sends an email to designated address
53+
* Slack: sends a notification to either a slack channel or an individual user
Lines changed: 143 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,143 @@
1+
// Module included in the following assemblies:
2+
//
3+
// * edge_computing/day_2_core_cnf_clusters/observability/telco-observability.adoc
4+
5+
:_mod-docs-content-type: CONCEPT
6+
[id="telco-observability-key-performance-metrics_{context}"]
7+
= Key performance metrics
8+
9+
Depending on your system, there can be hundreds of available measurements.
10+
11+
Here are some key metrics that you should pay attention to:
12+
13+
* `etcd` response times
14+
* API response times
15+
* Pod restarts and scheduling
16+
* Resource usage
17+
* OVN health
18+
* Overall cluster operator health
19+
20+
A good rule to follow is that if you decide that a metric is important, there should be an alert for it.
21+
22+
[NOTE]
23+
====
24+
You can check the available metrics by runnning the following command:
25+
[source,terminal]
26+
----
27+
$ oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -qsk http://localhost:9090/api/v1/metadata | jq '.data
28+
----
29+
====
30+
31+
[id="example-queries-promql"]
32+
== Example queries in PromQL
33+
34+
The following tables show some queries that you can explore in the metrics query browser using the {product-title} console.
35+
36+
[NOTE]
37+
====
38+
The URL for the console is https://<OpenShift Console FQDN>/monitoring/query-browser.
39+
You can get the Openshift Console FQDN by runnning the following command:
40+
[source,terminal]
41+
----
42+
$ oc get routes -n openshift-console console -o jsonpath='{.status.ingress[0].host}'
43+
----
44+
====
45+
46+
.Node memory & CPU usage
47+
[options="header"]
48+
|===
49+
50+
|Metric|Query
51+
52+
|CPU % requests by node
53+
|`sum by (node) (sum_over_time(kube_pod_container_resource_requests{resource="cpu"}[60m]))/sum by (node) (sum_over_time(kube_node_status_allocatable{resource="cpu"}[60m])) *100`
54+
55+
|Overall cluster CPU % utilization
56+
|`sum by (managed_cluster) (sum_over_time(kube_pod_container_resource_requests{resource="memory"}[60m]))/sum by (managed_cluster) (sum_over_time(kube_node_status_allocatable{resource="cpu"}[60m])) *100`
57+
58+
59+
|Memory % requests by node
60+
|`sum by (node) (sum_over_time(kube_pod_container_resource_requests{resource="memory"}[60m]))/sum by (node) (sum_over_time(kube_node_status_allocatable{resource="memory"}[60m])) *100`
61+
62+
|Overall cluster memory % utilization
63+
|`(1-(sum by (managed_cluster)(avg_over_time((node_memory_MemAvailable_bytes[60m])) ))/sum by (managed_cluster)(avg_over_time(kube_node_status_allocatable{resource="memory"}[60m])))*100`
64+
65+
|===
66+
67+
.API latency by verb
68+
[options="header"]
69+
|===
70+
71+
|Metric|Query
72+
73+
|`GET`
74+
|`histogram_quantile (0.99, sum by (le,managed_cluster) (sum_over_time(apiserver_request_duration_seconds_bucket{apiserver=~"kube-apiserver\|openshift-apiserver", verb="GET"}[60m])))`
75+
76+
|`PATCH`
77+
|`histogram_quantile (0.99, sum by (le,managed_cluster) (sum_over_time(apiserver_request_duration_seconds_bucket{apiserver="kube-apiserver\|openshift-apiserver", verb="PATCH"}[60m])))`
78+
79+
|`POST`
80+
|`histogram_quantile (0.99, sum by (le,managed_cluster) (sum_over_time(apiserver_request_duration_seconds_bucket{apiserver="kube-apiserver\|openshift-apiserver", verb="POST"}[60m])))`
81+
82+
|`LIST`
83+
|`histogram_quantile (0.99, sum by (le,managed_cluster) (sum_over_time(apiserver_request_duration_seconds_bucket{apiserver="kube-apiserver\|openshift-apiserver", verb="LIST"}[60m])))`
84+
85+
|`PUT`
86+
|`histogram_quantile (0.99, sum by (le,managed_cluster) (sum_over_time(apiserver_request_duration_seconds_bucket{apiserver="kube-apiserver\|openshift-apiserver", verb="PUT"}[60m])))`
87+
88+
|`DELETE`
89+
|`histogram_quantile (0.99, sum by (le,managed_cluster) (sum_over_time(apiserver_request_duration_seconds_bucket{apiserver="kube-apiserver\|openshift-apiserver", verb="DELETE"}[60m])))`
90+
91+
|Combined
92+
|`histogram_quantile(0.99, sum by (le,managed_cluster) (sum_over_time(apiserver_request_duration_seconds_bucket{apiserver=~"(openshift-apiserver\|kube-apiserver)", verb!="WATCH"}[60m])))`
93+
94+
|===
95+
96+
.`etcd`
97+
[options="header"]
98+
|===
99+
100+
|Metric|Query
101+
102+
|`fsync` 99th percentile latency (per instance)
103+
|`histogram_quantile(0.99, rate(etcd_disk_wal_fsync_duration_seconds_bucket[2m]))`
104+
105+
|`fsync` 99th percentile latency (per cluster)
106+
|`sum by (managed_cluster) ( histogram_quantile(0.99, rate(etcd_disk_wal_fsync_duration_seconds_bucket[60m])))`
107+
108+
|Leader elections
109+
|`sum(rate(etcd_server_leader_changes_seen_total[1440m]))`
110+
111+
|Network latency
112+
|`histogram_quantile(0.99, rate(etcd_network_peer_round_trip_time_seconds_bucket[5m]))`
113+
114+
|===
115+
116+
.Operator health
117+
[options="header"]
118+
|===
119+
120+
|Metric|Query
121+
122+
|Degraded operators
123+
|`sum by (managed_cluster, name) (avg_over_time(cluster_operator_conditions{condition="Degraded", name!="version"}[60m]))`
124+
125+
|Total degraded operators per cluster
126+
|`sum by (managed_cluster) (avg_over_time(cluster_operator_conditions{condition="Degraded", name!="version"}[60m] ))`
127+
128+
|===
129+
130+
[id="recommendations-for-storage-of-metrics"]
131+
== Recommendations for storage of metrics
132+
133+
Out of the box, Prometheus does not back up saved metrics with persistent storage.
134+
If you restart the Prometheus pods, all metrics data are lost.
135+
You should configure the monitoring stack to use the back-end storage that is available on the platform.
136+
To meet the high IO demands of Prometheus you should use local storage.
137+
138+
For Telco core clusters, you can use the Local Storage Operator for persistent storage for Prometheus.
139+
140+
{odf-first}, which deploys a ceph cluster for block, file, and object storage, is also a suitable candidate for a Telco core cluster.
141+
142+
To keep system resource requirements low on a RAN {sno} or far edge cluster, you should not provision backend storage for the monitoring stack.
143+
Such clusters forward all metrics to the hub cluster where you can provision a third party monitoring platform.
Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
// Module included in the following assemblies:
2+
//
3+
// * edge_computing/day_2_core_cnf_clusters/observability/telco-observability.adoc
4+
5+
:_mod-docs-content-type: CONCEPT
6+
[id="telco-observability-monitoring-stack_{context}"]
7+
= Understanding the monitoring stack
8+
9+
The monitoring stack uses the following components:
10+
11+
* Prometheus collects and analyzes metrics from {product-title} components and from workloads, if configured to do so.
12+
* Alertmanager is a component of Prometheus that handles routing, grouping, and silencing of alerts.
13+
* Thanos handles long term storage of metrics.
14+
15+
.{product-title} monitoring architecture
16+
image::monitoring-architecture.png[{product-title} monitoring architecture]
17+
18+
[NOTE]
19+
====
20+
For a {sno} cluster, you should disable Alertmanager and Thanos because the cluster sends all metrics to the hub cluster for analysis and retention.
21+
====

0 commit comments

Comments
 (0)