Skip to content

Metric counters report inconsistent values for policy metrics #115

@phenixblue

Description

@phenixblue

What happened:

I noticed that the policy failure counter metrics are providing an inconsistent value.

If you inspect the value for a particular counter metric incrementally it will change the value reflected in a way that is not accurate to policy evaluations.

Snippet from running curl (every 2 seconds) filtered for a single metric:

magtape_policy_total{count_type="fail",ns="test1",policy="policy-privileged-pod"} 35.0
magtape_policy_total{count_type="fail",ns="test1",policy="policy-privileged-pod"} 1.0
magtape_policy_total{count_type="fail",ns="test1",policy="policy-privileged-pod"} 1.0
magtape_policy_total{count_type="fail",ns="test1",policy="policy-privileged-pod"} 1.0
magtape_policy_total{count_type="fail",ns="test1",policy="policy-privileged-pod"} 35.0
magtape_policy_total{count_type="fail",ns="test1",policy="policy-privileged-pod"} 1.0
magtape_policy_total{count_type="fail",ns="test1",policy="policy-privileged-pod"} 1.0
magtape_policy_total{count_type="fail",ns="test1",policy="policy-privileged-pod"} 35.0
magtape_policy_total{count_type="fail",ns="test1",policy="policy-privileged-pod"} 1.0
magtape_policy_total{count_type="fail",ns="test1",policy="policy-privileged-pod"} 35.0
magtape_policy_total{count_type="fail",ns="test1",policy="policy-privileged-pod"} 1.0
magtape_policy_total{count_type="fail",ns="test1",policy="policy-privileged-pod"} 35.0
magtape_policy_total{count_type="fail",ns="test1",policy="policy-privileged-pod"} 1.0
magtape_policy_total{count_type="fail",ns="test1",policy="policy-privileged-pod"} 1.0
magtape_policy_total{count_type="fail",ns="test1",policy="policy-privileged-pod"} 1.0
magtape_policy_total{count_type="fail",ns="test1",policy="policy-privileged-pod"} 35.0
magtape_policy_total{count_type="fail",ns="test1",policy="policy-privileged-pod"} 1.0
magtape_policy_total{count_type="fail",ns="test1",policy="policy-privileged-pod"} 35.0
magtape_policy_total{count_type="fail",ns="test1",policy="policy-privileged-pod"} 1.0
magtape_policy_total{count_type="fail",ns="test1",policy="policy-privileged-pod"} 35.0
magtape_policy_total{count_type="fail",ns="test1",policy="policy-privileged-pod"} 1.0
magtape_policy_total{count_type="fail",ns="test1",policy="policy-privileged-pod"} 1.0
magtape_policy_total{count_type="fail",ns="test1",policy="policy-privileged-pod"} 1.0
magtape_policy_total{count_type="fail",ns="test1",policy="policy-privileged-pod"} 35.0
magtape_policy_total{count_type="fail",ns="test1",policy="policy-privileged-pod"} 35.0
magtape_policy_total{count_type="fail",ns="test1",policy="policy-privileged-pod"} 35.0
magtape_policy_total{count_type="fail",ns="test1",policy="policy-privileged-pod"} 1.0
magtape_policy_total{count_type="fail",ns="test1",policy="policy-privileged-pod"} 1.0
magtape_policy_total{count_type="fail",ns="test1",policy="policy-privileged-pod"} 1.0
magtape_policy_total{count_type="fail",ns="test1",policy="policy-privileged-pod"} 1.0
magtape_policy_total{count_type="fail",ns="test1",policy="policy-privileged-pod"} 1.0
magtape_policy_total{count_type="fail",ns="test1",policy="policy-privileged-pod"} 1.0
magtape_policy_total{count_type="fail",ns="test1",policy="policy-privileged-pod"} 35.0
magtape_policy_total{count_type="fail",ns="test1",policy="policy-privileged-pod"} 35.0
magtape_policy_total{count_type="fail",ns="test1",policy="policy-privileged-pod"} 1.0
magtape_policy_total{count_type="fail",ns="test1",policy="policy-privileged-pod"} 1.0
magtape_policy_total{count_type="fail",ns="test1",policy="policy-privileged-pod"} 35.0
magtape_policy_total{count_type="fail",ns="test1",policy="policy-privileged-pod"} 1.0
magtape_policy_total{count_type="fail",ns="test1",policy="policy-privileged-pod"} 35.0
magtape_policy_total{count_type="fail",ns="test1",policy="policy-privileged-pod"} 1.0

It's almost like there are multiple counters running in the background and the metrics route handler sometimes displays values from one, and sometimes displays values from the other.

What you expected to happen:

Metrics counter values should be consistent

How to reproduce it (as minimally and precisely as possible):

  • Deploy MagTape
  • Run make test-functional and/or manually apply some resources to force policy failures to increment the counters
  • Port forward to a specific MagTape pod on port 5000
  • Run a curl against the metrics endpoint in a loop and record the values for a specific metric and you should see the value change
$ for i in {1..100}; do curl -ks https://localhost:5000/metrics | grep "magtape_policy_total" | grep "test1" | grep "fail" | grep "privileged" >> /tmp/magtape-pod1-metrics.out; done

Anything else we need to know?:

MagTape was running with 3 replicas

Environment:

  • Kubernetes version (use kubectl version): v1.17
  • Cloud provider or hardware configuration:
  • Others:
    • MagTape v2.3.2

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions