[awsemfexporter] Fix grouping for container insights metrics for mixed metric types #320

lisguo · 2025-06-11T22:11:45Z

Description

Currently if there exists container insights metrics of different metric types (counter vs gauge vs histogram), emf exporter will split them into different emf logs

This is a problem for EBS NVMe metrics where the node_diskio_ebs_volume_queue_length metric is a gauge where the others are counters. This causes extra cost.

Testing

Tested this on an existing cluster and verified that the EBS NVMe metrics are in 1 emf log:

Before:

EMF Log 1:

    {
        "AutoScalingGroupName": "eks-core-node-group-20250516135915846500000035-54cb6cb2-2f6d-437a-9c5d-35345b597a93",
        "CloudWatchMetrics": [
            {
                "Namespace": "ContainerInsights",
                "Dimensions": [
                    [
                        "ClusterName"
                    ],
                    [
                        "ClusterName",
                        "InstanceId",
                        "NodeName"
                    ],
                    [
                        "ClusterName",
                        "InstanceId",
                        "NodeName",
                        "VolumeId"
                    ]
                ],
                "Metrics": [
                    {
                        "Name": "node_diskio_ebs_ec2_instance_performance_exceeded_iops",
                        "Unit": "Second",
                        "StorageResolution": 60
                    },
                    {
                        "Name": "node_diskio_ebs_ec2_instance_performance_exceeded_tp",
                        "Unit": "Second",
                        "StorageResolution": 60
                    },
                    {
                        "Name": "node_diskio_ebs_total_write_ops",
                        "Unit": "Count",
                        "StorageResolution": 60
                    },
                    {
                        "Name": "node_diskio_ebs_total_read_ops",
                        "Unit": "Count",
                        "StorageResolution": 60
                    },
                    {
                        "Name": "node_diskio_ebs_total_write_bytes",
                        "Unit": "Bytes",
                        "StorageResolution": 60
                    },
                    {
                        "Name": "node_diskio_ebs_total_read_bytes",
                        "Unit": "Bytes",
                        "StorageResolution": 60
                    },
                    {
                        "Name": "node_diskio_ebs_total_read_time",
                        "Unit": "Second",
                        "StorageResolution": 60
                    },
                    {
                        "Name": "node_diskio_ebs_volume_performance_exceeded_iops",
                        "Unit": "Second",
                        "StorageResolution": 60
                    },
                    {
                        "Name": "node_diskio_ebs_total_write_time",
                        "Unit": "Second",
                        "StorageResolution": 60
                    },
                    {
                        "Name": "node_diskio_ebs_volume_performance_exceeded_tp",
                        "Unit": "Second",
                        "StorageResolution": 60
                    }
                ]
            }
        ],
        "ClusterName": "my-cluster-name",
        "InstanceId": "i-0aebb26742a731ee8",
        "InstanceType": "m5.2xlarge",
        "NodeName": "ip-100-64-144-129.us-east-2.compute.internal",
        "PlatformType": "AWS::EKS",
        "Timestamp": "1749661186151",
        "Type": "NodeEBS",
        "Version": "0",
        "VolumeId": "vol-084d2896789180f9c",
        "http.scheme": "http",
        "instance_id": "i-0aebb26742a731ee8",
        "k8s.namespace.name": "kube-system",
        "kubernetes": {
            "host": "ip-100-64-144-129.us-east-2.compute.internal"
        },
        "net.host.name": "ebs-csi-node.kube-system.svc",
        "net.host.port": "3302",
        "server.address": "ebs-csi-node.kube-system.svc",
        "server.port": "3302",
        "service.instance.id": "ebs-csi-node.kube-system.svc:3302",
        "service.name": "containerInsightsNVMeExporterScraper",
        "url.scheme": "http",
        "volume_id": "vol-084d2896789180f9c",
        "node_diskio_ebs_ec2_instance_performance_exceeded_iops": 0,
        "node_diskio_ebs_ec2_instance_performance_exceeded_tp": 0,
        "node_diskio_ebs_total_read_bytes": 1060864,
        "node_diskio_ebs_total_read_ops": 47,
        "node_diskio_ebs_total_read_time": 0.02844800000000003,
        "node_diskio_ebs_total_write_bytes": 1355776,
        "node_diskio_ebs_total_write_ops": 12,
        "node_diskio_ebs_total_write_time": 0.02070000000003347,
        "node_diskio_ebs_volume_performance_exceeded_iops": 0,
        "node_diskio_ebs_volume_performance_exceeded_tp": 0
    }

EMF Log 2:

{
    "AutoScalingGroupName": "eks-core-node-group-20250516135915846500000035-54cb6cb2-2f6d-437a-9c5d-35345b597a93",
    "CloudWatchMetrics": [
        {
            "Namespace": "ContainerInsights",
            "Dimensions": [
                [
                    "ClusterName"
                ],
                [
                    "ClusterName",
                    "InstanceId",
                    "NodeName"
                ],
                [
                    "ClusterName",
                    "InstanceId",
                    "NodeName",
                    "VolumeId"
                ]
            ],
            "Metrics": [
                {
                    "Name": "node_diskio_ebs_volume_queue_length",
                    "Unit": "Count",
                    "StorageResolution": 60
                }
            ]
        }
    ],
    "ClusterName": "my-cluster-name",
    "InstanceId": "i-0aebb26742a731ee8",
    "InstanceType": "m5.2xlarge",
    "NodeName": "ip-100-64-144-129.us-east-2.compute.internal",
    "PlatformType": "AWS::EKS",
    "Timestamp": "1749661186151",
    "Type": "NodeEBS",
    "Version": "0",
    "VolumeId": "vol-084d2896789180f9c",
    "http.scheme": "http",
    "instance_id": "i-0aebb26742a731ee8",
    "k8s.namespace.name": "kube-system",
    "kubernetes": {
        "host": "ip-100-64-144-129.us-east-2.compute.internal"
    },
    "net.host.name": "ebs-csi-node.kube-system.svc",
    "net.host.port": "3302",
    "server.address": "ebs-csi-node.kube-system.svc",
    "server.port": "3302",
    "service.instance.id": "ebs-csi-node.kube-system.svc:3302",
    "service.name": "containerInsightsNVMeExporterScraper",
    "url.scheme": "http",
    "volume_id": "vol-084d2896789180f9c",
    "node_diskio_ebs_volume_queue_length": 0
}

After:

Sample Log:

{
    "AutoScalingGroupName": "eks-core-node-group-20250516135915846500000035-54cb6cb2-2f6d-437a-9c5d-35345b597a93",
    "CloudWatchMetrics": [
        {
            "Namespace": "ContainerInsights",
            "Dimensions": [
                [
                    "ClusterName"
                ],
                [
                    "ClusterName",
                    "InstanceId",
                    "NodeName"
                ],
                [
                    "ClusterName",
                    "InstanceId",
                    "NodeName",
                    "VolumeId"
                ]
            ],
            "Metrics": [
                {
                    "Name": "node_diskio_ebs_volume_queue_length",
                    "Unit": "Count",
                    "StorageResolution": 60
                },
                {
                    "Name": "node_diskio_ebs_total_write_ops",
                    "Unit": "Count",
                    "StorageResolution": 60
                },
                {
                    "Name": "node_diskio_ebs_total_read_bytes",
                    "Unit": "Bytes",
                    "StorageResolution": 60
                },
                {
                    "Name": "node_diskio_ebs_total_read_time",
                    "Unit": "Second",
                    "StorageResolution": 60
                },
                {
                    "Name": "node_diskio_ebs_volume_performance_exceeded_tp",
                    "Unit": "Second",
                    "StorageResolution": 60
                },
                {
                    "Name": "node_diskio_ebs_ec2_instance_performance_exceeded_tp",
                    "Unit": "Second",
                    "StorageResolution": 60
                },
                {
                    "Name": "node_diskio_ebs_total_write_bytes",
                    "Unit": "Bytes",
                    "StorageResolution": 60
                },
                {
                    "Name": "node_diskio_ebs_total_write_time",
                    "Unit": "Second",
                    "StorageResolution": 60
                },
                {
                    "Name": "node_diskio_ebs_ec2_instance_performance_exceeded_iops",
                    "Unit": "Second",
                    "StorageResolution": 60
                },
                {
                    "Name": "node_diskio_ebs_volume_performance_exceeded_iops",
                    "Unit": "Second",
                    "StorageResolution": 60
                },
                {
                    "Name": "node_diskio_ebs_total_read_ops",
                    "Unit": "Count",
                    "StorageResolution": 60
                }
            ]
        }
    ],
    "ClusterName": "my-cluster-name",
    "InstanceId": "i-0aebb26742a731ee8",
    "InstanceType": "m5.2xlarge",
    "NodeName": "ip-100-64-144-129.us-east-2.compute.internal",
    "PlatformType": "AWS::EKS",
    "Timestamp": "1749678166150",
    "Type": "NodeEBS",
    "Version": "0",
    "VolumeId": "vol-084d2896789180f9c",
    "http.scheme": "http",
    "instance_id": "i-0aebb26742a731ee8",
    "k8s.namespace.name": "kube-system",
    "kubernetes": {
        "host": "ip-100-64-144-129.us-east-2.compute.internal"
    },
    "net.host.name": "ebs-csi-node.kube-system.svc",
    "net.host.port": "3302",
    "server.address": "ebs-csi-node.kube-system.svc",
    "server.port": "3302",
    "service.instance.id": "ebs-csi-node.kube-system.svc:3302",
    "service.name": "containerInsightsNVMeExporterScraper",
    "url.scheme": "http",
    "volume_id": "vol-084d2896789180f9c",
    "node_diskio_ebs_ec2_instance_performance_exceeded_iops": 0,
    "node_diskio_ebs_ec2_instance_performance_exceeded_tp": 0,
    "node_diskio_ebs_total_read_bytes": 1060864,
    "node_diskio_ebs_total_read_ops": 47,
    "node_diskio_ebs_total_read_time": 0.028466999999999132,
    "node_diskio_ebs_total_write_bytes": 1355776,
    "node_diskio_ebs_total_write_ops": 12,
    "node_diskio_ebs_total_write_time": 0.014161000000058266,
    "node_diskio_ebs_volume_performance_exceeded_iops": 0,
    "node_diskio_ebs_volume_performance_exceeded_tp": 0,
    "node_diskio_ebs_volume_queue_length": 0
}

…ainer insights

dricross · 2025-06-12T15:39:00Z

emf exporter will split them into different emf log groups.

Do you mean that the exporter splits the metrics into separate log events within the same CloudWatch Log Group? Or do they actually go to separate CloudWatch log groups?

dricross · 2025-06-12T15:42:42Z

exporter/awsemfexporter/grouped_metric.go

+			if metadata.receiver == containerInsightsReceiver {
+				// For container insights, put all metrics in the same group regardless of type (ie gauge/counter)
+				metadata.groupedMetricMetadata.metricDataType = pmetric.MetricTypeEmpty
+			}


I am concerned about mixing other metric types into the same log event, e.g. histograms/summary with gauge/counters.

This will only group them into the same log event if they also have the same dimensions, right? If so, then I think it'd be ok.

We don't currently emit non-counter metrics for container insights today. But if one of the container insights prom scrapers (neuron/dcgm/nvme/etc) starts ingesting non-counter metrics, we want that all to be in the same log. Otherwise customers will be charged extra

Histogram/summary metrics have extra dimensions like "quantile" which we wouldn't want applied to the counter/gauge metrics since that doesn't make any sense. So I think they will have to be in a separate log event.

exporter/awsemfexporter/metric_translator_test.go

lisguo · 2025-06-12T14:19:30Z

exporter/awsemfexporter/grouped_metric.go

+
+			if metadata.receiver == containerInsightsReceiver {
+				// For container insights, put all metrics in the same group regardless of type (ie guage/counter)
+				metadata.groupedMetricMetadata.metricDataType = pmetric.MetricTypeEmpty


Need to test if this affects any other part of container insights. I see that the only other reference to the data type is in the translateGroupedMetricToCWMetric func:

if isPrometheusMetric { fields[fieldPrometheusMetricType] = fieldPrometheusTypes[groupedMetric.metadata.metricDataType] }

code:

opentelemetry-collector-contrib/exporter/awsemfexporter/metric_translator.go

Line 223 in 97a2f5e

fields[fieldPrometheusMetricType] = fieldPrometheusTypes[groupedMetric.metadata.metricDataType]

exporter/awsemfexporter/metric_translator.go

exporter/awsemfexporter/metric_translator_test.go

internal/aws/k8s/k8sclient/endpoint_slices_test.go

…pes are dropped

exporter/awsemfexporter/grouped_metric.go

lisguo · 2025-06-13T20:00:16Z

exporter/awsemfexporter/grouped_metric_test.go

-		idx := 0
+
+		// Sort metadata list to prevent race condition
+		var metadataList []cWMetricMetadata


This was a flaky test. The github runner runs the unit tests with -race. Had to fix it by sorting the metadata list

exporter/awsemfexporter/grouped_metric.go

Co-authored-by: Jeffrey Chien <chienjef@amazon.com>

lisguo added 2 commits June 11, 2025 17:52

Fix grouping for to allow multiple metric types in 1 emf log for cont…

6480c9b

…ainer insights

Update tests, fix formatting

3de4ed9

lisguo requested review from movence and jefchien June 11, 2025 22:12

lisguo added 2 commits June 12, 2025 10:17

Fix lint

967cd9b

Fix typo for lint

bae5d1d

lisguo requested review from dricross and removed request for movence June 12, 2025 14:58

lisguo marked this pull request as ready for review June 12, 2025 15:30

lisguo requested a review from mxiamxia as a code owner June 12, 2025 15:30

dricross reviewed Jun 12, 2025

View reviewed changes

exporter/awsemfexporter/metric_translator_test.go Outdated Show resolved Hide resolved

lisguo commented Jun 12, 2025

View reviewed changes

lisguo added 2 commits June 12, 2025 15:26

Change container insights gauge metrics to sum. Ensure unsupported ty…

43d26f6

…pes are dropped

Fix assert lint

f555eb3

lisguo commented Jun 13, 2025

View reviewed changes

exporter/awsemfexporter/grouped_metric.go Outdated Show resolved Hide resolved

lisguo and others added 2 commits June 13, 2025 13:24

Remove logic to drop unsupported types for ci

3eac67b

Merge branch 'aws-cwa-dev' into nvme-emf-fix

99cc28c

sky333999 approved these changes Jun 13, 2025

View reviewed changes

Merge branch 'aws-cwa-dev' into nvme-emf-fix

775dd21

sky333999 previously approved these changes Jun 13, 2025

View reviewed changes

Fix tests

49b21e9

lisguo dismissed sky333999’s stale review via 49b21e9 June 13, 2025 18:20

lisguo added 2 commits June 13, 2025 14:30

Fix linter again

0f07e19

Fix race condition

61eb7ab

lisguo commented Jun 13, 2025

View reviewed changes

sky333999 previously approved these changes Jun 13, 2025

View reviewed changes

jefchien previously approved these changes Jun 13, 2025

View reviewed changes

exporter/awsemfexporter/grouped_metric.go Outdated Show resolved Hide resolved

Update exporter/awsemfexporter/grouped_metric.go

af63694

Co-authored-by: Jeffrey Chien <chienjef@amazon.com>

lisguo dismissed stale reviews from jefchien and sky333999 via af63694 June 13, 2025 20:33

jefchien approved these changes Jun 13, 2025

View reviewed changes

sky333999 approved these changes Jun 13, 2025

View reviewed changes

lisguo merged commit 5b3f0a1 into aws-cwa-dev Jun 13, 2025
131 of 132 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[awsemfexporter] Fix grouping for container insights metrics for mixed metric types #320

[awsemfexporter] Fix grouping for container insights metrics for mixed metric types #320

Uh oh!

lisguo commented Jun 11, 2025 •

edited

Loading

Uh oh!

dricross commented Jun 12, 2025

Uh oh!

dricross Jun 12, 2025

Uh oh!

lisguo Jun 12, 2025

Uh oh!

dricross Jun 12, 2025

Uh oh!

Uh oh!

lisguo Jun 12, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lisguo Jun 13, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

[awsemfexporter] Fix grouping for container insights metrics for mixed metric types #320

[awsemfexporter] Fix grouping for container insights metrics for mixed metric types #320

Uh oh!

Conversation

lisguo commented Jun 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Testing

Uh oh!

dricross commented Jun 12, 2025

Uh oh!

dricross Jun 12, 2025

Choose a reason for hiding this comment

Uh oh!

lisguo Jun 12, 2025

Choose a reason for hiding this comment

Uh oh!

dricross Jun 12, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

lisguo Jun 12, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lisguo Jun 13, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lisguo commented Jun 11, 2025 •

edited

Loading