|
| 1 | +// Module included in the following assemblies: |
| 2 | +// |
| 3 | +// * edge_computing/day_2_core_cnf_clusters/observability/telco-observability.adoc |
| 4 | + |
| 5 | +:_mod-docs-content-type: CONCEPT |
| 6 | +[id="telco-observability-key-performance-metrics_{context}"] |
| 7 | += Key performance metrics |
| 8 | + |
| 9 | +Depending on your system, there can be hundreds of available measurements. |
| 10 | + |
| 11 | +Here are some key metrics that you should pay attention to: |
| 12 | + |
| 13 | +* `etcd` response times |
| 14 | +* API response times |
| 15 | +* Pod restarts and scheduling |
| 16 | +* Resource usage |
| 17 | +* OVN health |
| 18 | +* Overall cluster operator health |
| 19 | +
|
| 20 | +A good rule to follow is that if you decide that a metric is important, there should be an alert for it. |
| 21 | + |
| 22 | +[NOTE] |
| 23 | +==== |
| 24 | +You can check the available metrics by runnning the following command: |
| 25 | +[source,terminal] |
| 26 | +---- |
| 27 | +$ oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -qsk http://localhost:9090/api/v1/metadata | jq '.data |
| 28 | +---- |
| 29 | +==== |
| 30 | + |
| 31 | +[id="example-queries-promql"] |
| 32 | +== Example queries in PromQL |
| 33 | + |
| 34 | +The following tables show some queries that you can explore in the metrics query browser using the {product-title} console. |
| 35 | + |
| 36 | +[NOTE] |
| 37 | +==== |
| 38 | +The URL for the console is https://<OpenShift Console FQDN>/monitoring/query-browser. |
| 39 | +You can get the Openshift Console FQDN by runnning the following command: |
| 40 | +[source,terminal] |
| 41 | +---- |
| 42 | +$ oc get routes -n openshift-console console -o jsonpath='{.status.ingress[0].host}' |
| 43 | +---- |
| 44 | +==== |
| 45 | + |
| 46 | +.Node memory & CPU usage |
| 47 | +[options="header"] |
| 48 | +|=== |
| 49 | + |
| 50 | +|Metric|Query |
| 51 | + |
| 52 | +|CPU % requests by node |
| 53 | +|`sum by (node) (sum_over_time(kube_pod_container_resource_requests{resource="cpu"}[60m]))/sum by (node) (sum_over_time(kube_node_status_allocatable{resource="cpu"}[60m])) *100` |
| 54 | + |
| 55 | +|Overall cluster CPU % utilization |
| 56 | +|`sum by (managed_cluster) (sum_over_time(kube_pod_container_resource_requests{resource="memory"}[60m]))/sum by (managed_cluster) (sum_over_time(kube_node_status_allocatable{resource="cpu"}[60m])) *100` |
| 57 | + |
| 58 | + |
| 59 | +|Memory % requests by node |
| 60 | +|`sum by (node) (sum_over_time(kube_pod_container_resource_requests{resource="memory"}[60m]))/sum by (node) (sum_over_time(kube_node_status_allocatable{resource="memory"}[60m])) *100` |
| 61 | + |
| 62 | +|Overall cluster memory % utilization |
| 63 | +|`(1-(sum by (managed_cluster)(avg_over_time((node_memory_MemAvailable_bytes[60m])) ))/sum by (managed_cluster)(avg_over_time(kube_node_status_allocatable{resource="memory"}[60m])))*100` |
| 64 | + |
| 65 | +|=== |
| 66 | + |
| 67 | +.API latency by verb |
| 68 | +[options="header"] |
| 69 | +|=== |
| 70 | + |
| 71 | +|Metric|Query |
| 72 | + |
| 73 | +|`GET` |
| 74 | +|`histogram_quantile (0.99, sum by (le,managed_cluster) (sum_over_time(apiserver_request_duration_seconds_bucket{apiserver=~"kube-apiserver\|openshift-apiserver", verb="GET"}[60m])))` |
| 75 | + |
| 76 | +|`PATCH` |
| 77 | +|`histogram_quantile (0.99, sum by (le,managed_cluster) (sum_over_time(apiserver_request_duration_seconds_bucket{apiserver="kube-apiserver\|openshift-apiserver", verb="PATCH"}[60m])))` |
| 78 | + |
| 79 | +|`POST` |
| 80 | +|`histogram_quantile (0.99, sum by (le,managed_cluster) (sum_over_time(apiserver_request_duration_seconds_bucket{apiserver="kube-apiserver\|openshift-apiserver", verb="POST"}[60m])))` |
| 81 | + |
| 82 | +|`LIST` |
| 83 | +|`histogram_quantile (0.99, sum by (le,managed_cluster) (sum_over_time(apiserver_request_duration_seconds_bucket{apiserver="kube-apiserver\|openshift-apiserver", verb="LIST"}[60m])))` |
| 84 | + |
| 85 | +|`PUT` |
| 86 | +|`histogram_quantile (0.99, sum by (le,managed_cluster) (sum_over_time(apiserver_request_duration_seconds_bucket{apiserver="kube-apiserver\|openshift-apiserver", verb="PUT"}[60m])))` |
| 87 | + |
| 88 | +|`DELETE` |
| 89 | +|`histogram_quantile (0.99, sum by (le,managed_cluster) (sum_over_time(apiserver_request_duration_seconds_bucket{apiserver="kube-apiserver\|openshift-apiserver", verb="DELETE"}[60m])))` |
| 90 | + |
| 91 | +|Combined |
| 92 | +|`histogram_quantile(0.99, sum by (le,managed_cluster) (sum_over_time(apiserver_request_duration_seconds_bucket{apiserver=~"(openshift-apiserver\|kube-apiserver)", verb!="WATCH"}[60m])))` |
| 93 | + |
| 94 | +|=== |
| 95 | + |
| 96 | +.`etcd` |
| 97 | +[options="header"] |
| 98 | +|=== |
| 99 | + |
| 100 | +|Metric|Query |
| 101 | + |
| 102 | +|`fsync` 99th percentile latency (per instance) |
| 103 | +|`histogram_quantile(0.99, rate(etcd_disk_wal_fsync_duration_seconds_bucket[2m]))` |
| 104 | + |
| 105 | +|`fsync` 99th percentile latency (per cluster) |
| 106 | +|`sum by (managed_cluster) ( histogram_quantile(0.99, rate(etcd_disk_wal_fsync_duration_seconds_bucket[60m])))` |
| 107 | + |
| 108 | +|Leader elections |
| 109 | +|`sum(rate(etcd_server_leader_changes_seen_total[1440m]))` |
| 110 | + |
| 111 | +|Network latency |
| 112 | +|`histogram_quantile(0.99, rate(etcd_network_peer_round_trip_time_seconds_bucket[5m]))` |
| 113 | + |
| 114 | +|=== |
| 115 | + |
| 116 | +.Operator health |
| 117 | +[options="header"] |
| 118 | +|=== |
| 119 | + |
| 120 | +|Metric|Query |
| 121 | + |
| 122 | +|Degraded operators |
| 123 | +|`sum by (managed_cluster, name) (avg_over_time(cluster_operator_conditions{condition="Degraded", name!="version"}[60m]))` |
| 124 | + |
| 125 | +|Total degraded operators per cluster |
| 126 | +|`sum by (managed_cluster) (avg_over_time(cluster_operator_conditions{condition="Degraded", name!="version"}[60m] ))` |
| 127 | + |
| 128 | +|=== |
| 129 | + |
| 130 | +[id="recommendations-for-storage-of-metrics"] |
| 131 | +== Recommendations for storage of metrics |
| 132 | + |
| 133 | +Out of the box, Prometheus does not back up saved metrics with persistent storage. |
| 134 | +If you restart the Prometheus pods, all metrics data are lost. |
| 135 | +You should configure the monitoring stack to use the back-end storage that is available on the platform. |
| 136 | +To meet the high IO demands of Prometheus you should use local storage. |
| 137 | + |
| 138 | +For Telco core clusters, you can use the Local Storage Operator for persistent storage for Prometheus. |
| 139 | + |
| 140 | +{odf-first}, which deploys a ceph cluster for block, file, and object storage, is also a suitable candidate for a Telco core cluster. |
| 141 | + |
| 142 | +To keep system resource requirements low on a RAN {sno} or far edge cluster, you should not provision backend storage for the monitoring stack. |
| 143 | +Such clusters forward all metrics to the hub cluster where you can provision a third party monitoring platform. |
0 commit comments