Skip to content

rancher-sandbox/opentelemetry-stack-charts

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

opentelemetry-stack-charts

Requirements

Install certmanager (required by Otel operator):

kubectl apply -f https://github.com/cert-manager/cert-manager/releases/download/v1.18.2/cert-manager.yaml

Install the OpenTelemetry Operator helm charts (would not work as a dependency since required CRs are deployed on the fly in an admission webhook, causing our local templates to fail)

helm repo add open-telemetry https://open-telemetry.github.io/opentelemetry-helm-charts
help repo update 
helm install -n opentelemetry-operator-system --create-namespace opentelemetry-operator open-telemetry/opentelemetry-operator

Monitoring

Rancher monitoring setup

Reference : https://prometheus.io/docs/guides/opentelemetry/

Required version is whatever includes rancher/ob-team-charts#110

Required values:

prometheus:
    disableServiceDiscovery : true
    enableOTLPReceiver : true
    tsdb:
        outOfOrderTimeWindow: 300s
    # ...

Playing around with the following configuration may also prove useful:

prometheus:
    otlp:
        promoteResourceAttributes:
        - service.instance.id
        - service.name
        - service.namespace
        - cloud.availability_zone
        - cloud.region
        - container.name
        - deployment.environment.name
        - k8s.cluster.name
        - k8s.container.name
        - k8s.cronjob.name
        - k8s.daemonset.name
        - k8s.deployment.name
        - k8s.job.name
        - k8s.namespace.name
        - k8s.pod.name
        - k8s.replicaset.name
        - k8s.statefulset.name
        translationStrategy: NoUTF8EscapingWithSuffixes
helm install otel-stack -n otel-stack --create-namespace ./ --set metricsCollector.enabled=true

Grafana dashboards

For debugging opentelemetry collector using the self-telemetry metrics, use: https://grafana.com/grafana/dashboards/15983-opentelemetry-collector/

Analyzing metric differences

Build the hacky tool for analyzing differences:

cd hack && make build
kubectl port-forward -n cattle-monitoring-system svc/rancher-monitoring-prometheus 9090:9090

Setup monitoring with default prometheus collection:

cd hack && ./bin/tools -d prometheus 

Setup monitoring with otel metrics collection (wait a couple of minutes to wait for collectors to start and ingest all relevant metrics):

cd hack && ./bin/tools -d otel

Then run the command

cd hack && ./bin/tools compare

As of writing this, the meaningful differences are that otel collector adds total suffixes by default to counters:

[
    "apiextensions_openapi_v2_regeneration_count_total",
    "apiextensions_openapi_v3_regeneration_count_total",
    "apiserver_admission_webhook_fail_open_count_total",
    "apiserver_admission_webhook_rejection_count_total",
    "apiserver_egress_dialer_dial_failure_count_total",
    "authenticated_user_requests_total",
    "authentication_attempts_total",
    "container_memory_failcnt_total",
    "endpoint_slice_controller_changes_total",
    "endpoint_slice_controller_syncs_total",
    "endpoint_slice_mirroring_controller_changes_total",
    "go_cpu_classes_cpu_seconds_total",
    "go_cpu_classes_gc_cpu_seconds_total",
    "go_cpu_classes_scavenge_cpu_seconds_total",
    "go_gc_cycles_gc_cycles_total",
    "go_sync_mutex_wait_seconds_total",
    "kube_poddisruptionbudget_created",
    "kube_poddisruptionbudget_status_current_healthy",
    "kube_poddisruptionbudget_status_desired_healthy",
    "kube_poddisruptionbudget_status_expected_pods",
    "kube_poddisruptionbudget_status_observed_generation",
    "kube_poddisruptionbudget_status_pod_disruptions_allowed",
    "kubelet_evented_pleg_connection_error_count_total",
    "kubelet_evented_pleg_connection_success_count_total",
    "kubelet_pleg_discard_events_total",
    "lasso_controller_handler_execution_total"
]

where as prometheus does not by default:

[
    "apiextensions_openapi_v2_regeneration_count",
    "apiextensions_openapi_v3_regeneration_count",
    "apiserver_admission_webhook_fail_open_count",
    "apiserver_admission_webhook_rejection_count",
    "apiserver_egress_dialer_dial_failure_count",
    "authenticated_user_requests",
    "authentication_attempts",
    "container_memory_failcnt",
    "endpoint_slice_controller_changes",
    "endpoint_slice_controller_syncs",
    "endpoint_slice_mirroring_controller_changes",
    "go_cpu_classes_gc_total_cpu_seconds_total",
    "go_cpu_classes_scavenge_total_cpu_seconds_total",
    "go_cpu_classes_total_cpu_seconds_total",
    "go_gc_cycles_total_gc_cycles_total",
    "go_sync_mutex_wait_total_seconds_total",
    "kubelet_evented_pleg_connection_error_count",
    "kubelet_evented_pleg_connection_success_count",
    "kubelet_pleg_discard_events",
    "lasso_controller_total_handler_execution",
]

Also note that the collector instances are not able to scrape with TLS configs at the moment, so for example the otel collector will not be able to collect:

A fork of the upstream operator will be required to support that functionality.

[
    "prometheus_operator_build_info",
    "prometheus_operator_feature_gate",
    "prometheus_operator_kubelet_managed_resource",
    "prometheus_operator_kubernetes_client_http_request_duration_seconds_count",
    "prometheus_operator_kubernetes_client_http_request_duration_seconds_sum",
    "prometheus_operator_kubernetes_client_http_requests_total",
    "prometheus_operator_kubernetes_client_rate_limiter_duration_seconds_count",
    "prometheus_operator_kubernetes_client_rate_limiter_duration_seconds_sum",
    "prometheus_operator_list_operations_failed_total",
    "prometheus_operator_list_operations_total",
    "prometheus_operator_managed_resources",
    "prometheus_operator_node_address_lookup_errors_total",
    "prometheus_operator_node_syncs_failed_total",
    "prometheus_operator_node_syncs_total",
    "prometheus_operator_ready",
    "prometheus_operator_reconcile_duration_seconds_bucket",
    "prometheus_operator_reconcile_duration_seconds_count",
    "prometheus_operator_reconcile_duration_seconds_sum",
    "prometheus_operator_reconcile_errors_total",
    "prometheus_operator_reconcile_operations_total",
    "prometheus_operator_reconcile_sts_delete_create_total",
    "prometheus_operator_spec_replicas",
    "prometheus_operator_spec_shards",
    "prometheus_operator_status_update_errors_total",
    "prometheus_operator_status_update_operations_total",
    "prometheus_operator_syncs",
    "prometheus_operator_triggered_total",
    "prometheus_operator_watch_operations_failed_total",
    "prometheus_operator_watch_operations_total",
    "prometheus_target_metadata_cache_bytes",
    "prometheus_target_metadata_cache_entries",
    "prometheus_target_scrape_pool_symboltable_items",
    "prometheus_target_scrape_pool_sync_total",
    "prometheus_target_scrape_pool_target_limit",
    "prometheus_target_scrape_pool_targets",
    "prometheus_target_sync_failed_total",
    "prometheus_target_sync_length_seconds",
    "prometheus_target_sync_length_seconds_count",
    "prometheus_target_sync_length_seconds_sum"
]

About

No description, website, or topics provided.

Resources

License

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published