Skip to content

Commit 3af4acb

Browse files
authored
Add metrics and alerts tutorial to the docs (#6341)
Signed-off-by: assaf-admi <aadmi@redhat.com>
1 parent e859a56 commit 3af4acb

File tree

1 file changed

+211
-3
lines changed

1 file changed

+211
-3
lines changed

website/content/en/docs/building-operators/golang/advanced-topics.md

Lines changed: 211 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -113,15 +113,205 @@ func init() {
113113
* After adding new import paths to your operator project, run `go mod vendor` if a `vendor/` directory is present in the root of your project directory to fulfill these dependencies.
114114
* Your 3rd party resource needs to be added before add the controller in `"Setup all Controllers"`.
115115
116-
### Metrics
116+
### Monitoring and Observability
117+
This section covers how to create custom metrics, [alerts] and [recording rules] for your operator. It focuses on the technical aspects, and demonstrates the implementation by updating the sample [memcached-operator].
117118
118-
To learn about how metrics work in the Operator SDK read the [metrics section][metrics_doc] of the Kubebuilder documentation.
119+
For more information regarding monitoring best practices, take a look at our docs on [observability-best-practices].
119120
121+
#### Prerequisites
122+
The following steps are required in order to inspect the operator's custom metrics, alerts and recording rules:
123+
- Install Prometheus and Prometheus Operator. We recommend using [kube-prometheus] in production if you don’t have your own monitoring system. If you are just experimenting, you can only install Prometheus and Prometheus Operator.
124+
- Make sure Prometheus has access to the operator's namespace, by setting the corresponding RBAC rules.
125+
126+
Example: [prometheus_role.yaml] and [prometheus_role_binding.yaml]
127+
128+
#### Publishing Custom Metrics
129+
If you wish to publish custom metrics for your operator, this can be easily achieved by using the global registry from `controller-runtime/pkg/metrics`.
130+
One way to achieve this is to declare your collectors as global variables, register them using `RegisterMetrics()` and call it in the controller's `init()` function.
131+
132+
Example custom metric: [MemcachedDeploymentSizeUndesiredCountTotal]
133+
134+
```go
135+
package monitoring
136+
137+
import (
138+
"github.com/prometheus/client_golang/prometheus"
139+
"sigs.k8s.io/controller-runtime/pkg/metrics"
140+
)
141+
142+
var (
143+
MemcachedDeploymentSizeUndesiredCountTotal = prometheus.NewCounter(
144+
prometheus.CounterOpts{
145+
Name: "memcached_deployment_size_undesired_count_total",
146+
Help: "Total number of times the deployment size was not as desired.",
147+
},
148+
)
149+
)
150+
151+
// RegisterMetrics will register metrics with the global prometheus registry
152+
func RegisterMetrics() {
153+
metrics.Registry.MustRegister(MemcachedDeploymentSizeUndesiredCountTotal)
154+
}
155+
```
156+
157+
- The above example creates a new `Counter` metric. For other metrics' types, see [Prometheus Documentation].
158+
- For more information regarding operators metrics best-practices, please follow [observability-best-practices].
159+
160+
[init() function example]:
161+
162+
```go
163+
package main
164+
165+
166+
import (
167+
...
168+
"github.com/example/memcached-operator/monitoring"
169+
)
170+
171+
func init() {
172+
...
173+
monitoring.RegisterMetrics()
174+
...
175+
}
176+
```
177+
178+
The next step would be to set the controller's logic according to which we update the metric's value. In this case, the new metric type is `Counter`, thus a valid update operation would be to increment its value.
179+
180+
[Metric update example]:
181+
182+
```go
183+
...
184+
size := memcached.Spec.Size
185+
if *found.Spec.Replicas != size {
186+
// Increment MemcachedDeploymentSizeUndesiredCountTotal metric by 1
187+
monitoring.MemcachedDeploymentSizeUndesiredCountTotal.Inc()
188+
}
189+
...
190+
```
191+
Different metrics types have different valid operations. For more information, please follow [Prometheus Golang client].
192+
193+
#### Publishing Alerts and Recording Rules
194+
In order to add alerts and recording rules, which are unique to the operator's needs, we'll create a dedicated PrometheusRule object, by using [prometheus-operator API].
195+
196+
[PrometheusRule example]:
197+
198+
```go
199+
package monitoring
200+
201+
import (
202+
monitoringv1 "github.com/prometheus-operator/prometheus-operator/pkg/apis/monitoring/v1"
203+
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
204+
"k8s.io/apimachinery/pkg/util/intstr"
205+
)
206+
207+
// NewPrometheusRule creates new PrometheusRule(CR) for the operator to have alerts and recording rules
208+
func NewPrometheusRule(namespace string) *monitoringv1.PrometheusRule {
209+
return &monitoringv1.PrometheusRule{
210+
TypeMeta: metav1.TypeMeta{
211+
APIVersion: monitoringv1.SchemeGroupVersion.String(),
212+
Kind: "PrometheusRule",
213+
},
214+
ObjectMeta: metav1.ObjectMeta{
215+
Name: "memcached-operator-rules",
216+
Namespace: "memcached-operator-system",
217+
},
218+
Spec: *NewPrometheusRuleSpec(),
219+
}
220+
}
221+
222+
// NewPrometheusRuleSpec creates PrometheusRuleSpec for alerts and recording rules
223+
func NewPrometheusRuleSpec() *monitoringv1.PrometheusRuleSpec {
224+
return &monitoringv1.PrometheusRuleSpec{
225+
Groups: []monitoringv1.RuleGroup{{
226+
Name: "memcached.rules",
227+
Rules: []monitoringv1.Rule{
228+
createOperatorUpTotalRecordingRule(),
229+
createOperatorDownAlertRule()
230+
},
231+
}},
232+
}
233+
}
234+
235+
// createOperatorUpTotalRecordingRule creates memcached_operator_up_total recording rule
236+
func createOperatorUpTotalRecordingRule() monitoringv1.Rule {
237+
return monitoringv1.Rule{
238+
Record: "memcached_operator_up_total",
239+
Expr: intstr.FromString("sum(up{pod=~'memcached-operator-controller-manager-.*'} or vector(0))"),
240+
}
241+
}
242+
243+
// createOperatorDownAlertRule creates MemcachedOperatorDown alert rule
244+
func createOperatorDownAlertRule() monitoringv1.Rule {
245+
return monitoringv1.Rule{
246+
Alert: "MemcachedOperatorDown",
247+
Expr: intstr.FromString("memcached_operator_up_total == 0"),
248+
Annotations: map[string]string{
249+
"description": "No running memcached-operator pods were detected in the last 5 min.",
250+
},
251+
For: "5m",
252+
Labels: map[string]string{
253+
"severity": "critical",
254+
},
255+
}
256+
}
257+
```
258+
259+
Then, we may want to ensure that the new PrometheusRule is being created and reconciled. One way to achieve this is by expanding the existing `Reconcile()` function logic.
260+
261+
[PrometheusRule reconciliation example]:
262+
263+
```go
264+
func (r *MemcachedReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
265+
...
266+
...
267+
// Check if prometheus rule already exists, if not create a new one
268+
foundRule := &monitoringv1.PrometheusRule{}
269+
err := r.Get(ctx, types.NamespacedName{Name: ruleName, Namespace: namespace}, foundRule)
270+
if err != nil && apierrors.IsNotFound(err) {
271+
// Define a new prometheus rule
272+
prometheusRule := monitoring.NewPrometheusRule(namespace)
273+
if err := r.Create(ctx, prometheusRule); err != nil {
274+
log.Error(err, "Failed to create prometheus rule")
275+
return ctrl.Result{}, nil
276+
}
277+
}
278+
279+
if err == nil {
280+
// Check if prometheus rule spec was changed, if so set as desired
281+
desiredRuleSpec := monitoring.NewPrometheusRuleSpec()
282+
if !reflect.DeepEqual(foundRule.Spec.DeepCopy(), desiredRuleSpec) {
283+
desiredRuleSpec.DeepCopyInto(&foundRule.Spec)
284+
if r.Update(ctx, foundRule); err != nil {
285+
log.Error(err, "Failed to update prometheus rule")
286+
return ctrl.Result{}, nil
287+
}
288+
}
289+
...
290+
...
291+
}
292+
```
293+
294+
- Please review the [observability-best-practices] for additional important information regarding alerts and recording rules.
295+
296+
297+
#### Alerts Unit Testing
298+
It is highly recommended implementing unit tests for prometheus rules. For more information, please follow the Prometheus [unit testing documentation]. For examples of unit testing in a Golang operator, see the sample memcached-operator [alerts unit tests].
299+
300+
#### Inspecting the metrics, alerts and recording rules with Prometheus UI
301+
Finally, in order to inspect the exposed metrics and alerts, we need to forward the corresponding port where metrics are published by Prometheus (usually `9090`, which is the default value). This can be done with the following command:
302+
```bash
303+
$ kubectl -n monitoring port-forward svc/prometheus-k8s 9090
304+
```
305+
306+
307+
where we assume that the prometheus service is available in the `monitoring` namespace.
308+
309+
Now you can access Prometheus UI using `http://localhost:9090`. For more details on exposing prometheus metrics, please refer [kube-prometheus docs].
120310
121311
### Handle Cleanup on Deletion
122312
123313
Operators may create objects as part of their operational duty. Object accumulation can consume unnecessary resources, slow down the API and clutter the user interface. As such it is important for operators to keep good hygiene and to clean up resources when they are not needed. Here are a few common scenarios.
124-
314+
125315
#### Internal Resources
126316
127317
A typical example of correct resource cleanup is the [Jobs][jobs] implementation. When a Job is created, one or multiple Pods are created as child resources. When a Job is deleted, the associated Pods are deleted as well. This is a very common pattern easily achieved by setting an owner reference from the parent (Job) to the child (Pod) object. Here is a code snippet for doing so, where "r" is the reconcilier and "ctrl" the controller-runtime library:
@@ -311,3 +501,21 @@ Authors may decide to distribute their bundles for various architectures: x86_64
311501
[apimachinery_condition]: https://github.com/kubernetes/apimachinery/blob/d4f471b82f0a17cda946aeba446770563f92114d/pkg/apis/meta/v1/types.go#L1368
312502
[helpers-conditions]: https://github.com/kubernetes/apimachinery/blob/master/pkg/api/meta/conditions.go
313503
[multi_arch]:/docs/advanced-topics/multi-arch
504+
[observability-best-practices]:https://sdk.operatorframework.io/docs/best-practices/observability-best-practices/
505+
[alerts]:https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/
506+
[recording rules]:https://prometheus.io/docs/prometheus/latest/configuration/recording_rules/
507+
[prometheus_role.yaml]:https://github.com/operator-framework/operator-sdk/blob/master/testdata/go/v4-alpha/monitoring/memcached-operator/config/rbac/prometheus_role.yaml
508+
[prometheus_role_binding.yaml]:https://github.com/operator-framework/operator-sdk/blob/master/testdata/go/v4-alpha/monitoring/memcached-operator/config/rbac/prometheus_role_binding.yaml
509+
[MemcachedDeploymentSizeUndesiredCountTotal]:https://github.com/operator-framework/operator-sdk/blob/master/testdata/go/v4-alpha/monitoring/memcached-operator/monitoring/metrics.go
510+
[init() function example]:https://github.com/operator-framework/operator-sdk/blob/master/testdata/go/v4-alpha/monitoring/memcached-operator/cmd/main.go
511+
[Metric update example]:https://github.com/operator-framework/operator-sdk/blob/master/testdata/go/v4-alpha/monitoring/memcached-operator/internal/controller/memcached_controller.go
512+
[Prometheus Documentation]:https://prometheus.io/docs/concepts/metric_types/
513+
[Prometheus Golang client]:https://pkg.go.dev/github.com/prometheus/client_golang/prometheus
514+
[kube-prometheus]:https://github.com/prometheus-operator/kube-prometheus
515+
[memcached-operator]:https://github.com/operator-framework/operator-sdk/tree/master/testdata/go/v4-alpha/monitoring/memcached-operator
516+
[prometheus-operator API]:https://github.com/prometheus-operator/prometheus-operator/blob/main/Documentation/api.md
517+
[PrometheusRule example]:https://github.com/operator-framework/operator-sdk/blob/master/testdata/go/v4-alpha/monitoring/memcached-operator/monitoring/alerts.go
518+
[PrometheusRule reconciliation example]:https://github.com/operator-framework/operator-sdk/blob/master/testdata/go/v4-alpha/monitoring/memcached-operator/internal/controller/memcached_controller.go
519+
[unit testing documentation]:https://prometheus.io/docs/prometheus/latest/configuration/unit_testing_rules/
520+
[alerts unit tests]:https://github.com/operator-framework/operator-sdk/tree/master/testdata/go/v4-alpha/monitoring/memcached-operator/monitoring/prom-rule-ci
521+
[kube-prometheus docs]:https://github.com/prometheus-operator/kube-prometheus/blob/main/docs/access-ui.md#prometheus

0 commit comments

Comments
 (0)