test(ab): do not print dimension that are the same across all metrics

roypat · roypat · commit fcb39a68b19c · 2025-04-25T10:49:02.000Z
When an A/B-Test fails, it prints all dimensions associated with the
metric that changed. However, if some dimension is the same across
literally all metrics emitted (for example, instance name and host
kernel version will never change in the middle of a test run), then
that's arguably just noise, and makes it hard to parse potentially
interesting dimensions. So avoid printing all dimensions that are
literally the same across all metrics.

Note that this does _not_ mean for that example if cpu_utilization only
changes to read throughput that the "read vs write" dimension won't be
printed anymore. We only drop dimensions if the are the same across
_all_ metrics, regardless of whether they had a statistically
significant change. In this scenario, the "mode: write" metric still
exists, it simply didn't change, and so the "mode: read" line won't be
dropped from the output.

Before:

[Firecracker A/B-Test Runner] A/B-testing shows a change of -2.07μs, or
-4.70%, (from 44.04μs to 41.98μs) for metric clat_read with p=0.0002.
This means that observing a change of this magnitude or worse, assuming
that performance characteristics did not change across the tested
commits, has a probability of 0.02%. Tested Dimensions:
{
  "cpu_model": "AMD EPYC 7R13 48-Core Processor",
  "fio_block_size": "4096",
  "fio_mode": "randrw",
  "guest_kernel": "linux-6.1",
  "guest_memory": "1024.0MB",
  "host_kernel": "linux-6.8",
  "instance": "m6a.metal",
  "io_engine": "Sync",
  "performance_test": "test_block_latency",
  "rootfs": "ubuntu-24.04.squashfs",
  "vcpus": "2"
}

After:

[Firecracker A/B-Test Runner] A/B-testing shows a change of -2.07μs, or
-4.70%, (from 44.04μs to 41.98μs) for metric clat_read with p=0.0002.
This means that observing a change of this magnitude or worse, assuming
that performance characteristics did not change across the tested
commits, has a probability of 0.02%. Tested Dimensions:
{
  "guest_kernel": "linux-6.1",
  "io_engine": "Sync",
  "vcpus": "2"
}

Signed-off-by: Patrick Roy &lt;roypat@amazon.co.uk&gt;
diff --git a/tools/ab_test.py b/tools/ab_test.py
@@ -114,6 +114,8 @@ def load_data_series(report_path: Path, tag=None, *, reemit: bool = False):
     # Dictionary mapping EMF dimensions to A/B-testable metrics/properties
     processed_emf = {}
 
+    distinct_values_per_dimenson = defaultdict(set)
+
     report = json.loads(report_path.read_text("UTF-8"))
     for test in report["tests"]:
         for line in test["teardown"]["stdout"].splitlines():
@@ -133,6 +135,9 @@ def load_data_series(report_path: Path, tag=None, *, reemit: bool = False):
                 if not dimensions:
                     continue
 
+                for dimension, value in dimensions.items():
+                    distinct_values_per_dimenson[dimension].add(value)
+
                 dimension_set = frozenset(dimensions.items())
 
                 if dimension_set not in processed_emf:
@@ -149,7 +154,24 @@ def load_data_series(report_path: Path, tag=None, *, reemit: bool = False):
 
                         values.extend(result[metric][0])
 
-    return processed_emf
+    irrelevant_dimensions = set()
+
+    for dimension, distinct_values in distinct_values_per_dimenson.items():
+        if len(distinct_values) == 1:
+            irrelevant_dimensions.add(dimension)
+
+    post_processed_emf = {}
+
+    for dimension_set, metrics in processed_emf.items():
+        processed_key = frozenset(
+            (dim, value)
+            for (dim, value) in dimension_set
+            if dim not in irrelevant_dimensions
+        )
+
+        post_processed_emf[processed_key] = metrics
+
+    return post_processed_emf
 
 
 def collect_data(binary_dir: Path, tests: list[str]):