Skip to content

[awsemfexporter] Fix grouping for container insights metrics for mixed metric types #320

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 13 commits into from
Jun 13, 2025

Conversation

lisguo
Copy link

@lisguo lisguo commented Jun 11, 2025

Description

Currently if there exists container insights metrics of different metric types (counter vs gauge vs histogram), emf exporter will split them into different emf logs

This is a problem for EBS NVMe metrics where the node_diskio_ebs_volume_queue_length metric is a gauge where the others are counters. This causes extra cost.

Testing

Tested this on an existing cluster and verified that the EBS NVMe metrics are in 1 emf log:

Before:
Screenshot 2025-06-12 at 12 09 59 PM

EMF Log 1:

    {
        "AutoScalingGroupName": "eks-core-node-group-20250516135915846500000035-54cb6cb2-2f6d-437a-9c5d-35345b597a93",
        "CloudWatchMetrics": [
            {
                "Namespace": "ContainerInsights",
                "Dimensions": [
                    [
                        "ClusterName"
                    ],
                    [
                        "ClusterName",
                        "InstanceId",
                        "NodeName"
                    ],
                    [
                        "ClusterName",
                        "InstanceId",
                        "NodeName",
                        "VolumeId"
                    ]
                ],
                "Metrics": [
                    {
                        "Name": "node_diskio_ebs_ec2_instance_performance_exceeded_iops",
                        "Unit": "Second",
                        "StorageResolution": 60
                    },
                    {
                        "Name": "node_diskio_ebs_ec2_instance_performance_exceeded_tp",
                        "Unit": "Second",
                        "StorageResolution": 60
                    },
                    {
                        "Name": "node_diskio_ebs_total_write_ops",
                        "Unit": "Count",
                        "StorageResolution": 60
                    },
                    {
                        "Name": "node_diskio_ebs_total_read_ops",
                        "Unit": "Count",
                        "StorageResolution": 60
                    },
                    {
                        "Name": "node_diskio_ebs_total_write_bytes",
                        "Unit": "Bytes",
                        "StorageResolution": 60
                    },
                    {
                        "Name": "node_diskio_ebs_total_read_bytes",
                        "Unit": "Bytes",
                        "StorageResolution": 60
                    },
                    {
                        "Name": "node_diskio_ebs_total_read_time",
                        "Unit": "Second",
                        "StorageResolution": 60
                    },
                    {
                        "Name": "node_diskio_ebs_volume_performance_exceeded_iops",
                        "Unit": "Second",
                        "StorageResolution": 60
                    },
                    {
                        "Name": "node_diskio_ebs_total_write_time",
                        "Unit": "Second",
                        "StorageResolution": 60
                    },
                    {
                        "Name": "node_diskio_ebs_volume_performance_exceeded_tp",
                        "Unit": "Second",
                        "StorageResolution": 60
                    }
                ]
            }
        ],
        "ClusterName": "my-cluster-name",
        "InstanceId": "i-0aebb26742a731ee8",
        "InstanceType": "m5.2xlarge",
        "NodeName": "ip-100-64-144-129.us-east-2.compute.internal",
        "PlatformType": "AWS::EKS",
        "Timestamp": "1749661186151",
        "Type": "NodeEBS",
        "Version": "0",
        "VolumeId": "vol-084d2896789180f9c",
        "http.scheme": "http",
        "instance_id": "i-0aebb26742a731ee8",
        "k8s.namespace.name": "kube-system",
        "kubernetes": {
            "host": "ip-100-64-144-129.us-east-2.compute.internal"
        },
        "net.host.name": "ebs-csi-node.kube-system.svc",
        "net.host.port": "3302",
        "server.address": "ebs-csi-node.kube-system.svc",
        "server.port": "3302",
        "service.instance.id": "ebs-csi-node.kube-system.svc:3302",
        "service.name": "containerInsightsNVMeExporterScraper",
        "url.scheme": "http",
        "volume_id": "vol-084d2896789180f9c",
        "node_diskio_ebs_ec2_instance_performance_exceeded_iops": 0,
        "node_diskio_ebs_ec2_instance_performance_exceeded_tp": 0,
        "node_diskio_ebs_total_read_bytes": 1060864,
        "node_diskio_ebs_total_read_ops": 47,
        "node_diskio_ebs_total_read_time": 0.02844800000000003,
        "node_diskio_ebs_total_write_bytes": 1355776,
        "node_diskio_ebs_total_write_ops": 12,
        "node_diskio_ebs_total_write_time": 0.02070000000003347,
        "node_diskio_ebs_volume_performance_exceeded_iops": 0,
        "node_diskio_ebs_volume_performance_exceeded_tp": 0
    }

EMF Log 2:

{
    "AutoScalingGroupName": "eks-core-node-group-20250516135915846500000035-54cb6cb2-2f6d-437a-9c5d-35345b597a93",
    "CloudWatchMetrics": [
        {
            "Namespace": "ContainerInsights",
            "Dimensions": [
                [
                    "ClusterName"
                ],
                [
                    "ClusterName",
                    "InstanceId",
                    "NodeName"
                ],
                [
                    "ClusterName",
                    "InstanceId",
                    "NodeName",
                    "VolumeId"
                ]
            ],
            "Metrics": [
                {
                    "Name": "node_diskio_ebs_volume_queue_length",
                    "Unit": "Count",
                    "StorageResolution": 60
                }
            ]
        }
    ],
    "ClusterName": "my-cluster-name",
    "InstanceId": "i-0aebb26742a731ee8",
    "InstanceType": "m5.2xlarge",
    "NodeName": "ip-100-64-144-129.us-east-2.compute.internal",
    "PlatformType": "AWS::EKS",
    "Timestamp": "1749661186151",
    "Type": "NodeEBS",
    "Version": "0",
    "VolumeId": "vol-084d2896789180f9c",
    "http.scheme": "http",
    "instance_id": "i-0aebb26742a731ee8",
    "k8s.namespace.name": "kube-system",
    "kubernetes": {
        "host": "ip-100-64-144-129.us-east-2.compute.internal"
    },
    "net.host.name": "ebs-csi-node.kube-system.svc",
    "net.host.port": "3302",
    "server.address": "ebs-csi-node.kube-system.svc",
    "server.port": "3302",
    "service.instance.id": "ebs-csi-node.kube-system.svc:3302",
    "service.name": "containerInsightsNVMeExporterScraper",
    "url.scheme": "http",
    "volume_id": "vol-084d2896789180f9c",
    "node_diskio_ebs_volume_queue_length": 0
}

After:
Screenshot 2025-06-12 at 12 10 21 PM

Sample Log:

{
    "AutoScalingGroupName": "eks-core-node-group-20250516135915846500000035-54cb6cb2-2f6d-437a-9c5d-35345b597a93",
    "CloudWatchMetrics": [
        {
            "Namespace": "ContainerInsights",
            "Dimensions": [
                [
                    "ClusterName"
                ],
                [
                    "ClusterName",
                    "InstanceId",
                    "NodeName"
                ],
                [
                    "ClusterName",
                    "InstanceId",
                    "NodeName",
                    "VolumeId"
                ]
            ],
            "Metrics": [
                {
                    "Name": "node_diskio_ebs_volume_queue_length",
                    "Unit": "Count",
                    "StorageResolution": 60
                },
                {
                    "Name": "node_diskio_ebs_total_write_ops",
                    "Unit": "Count",
                    "StorageResolution": 60
                },
                {
                    "Name": "node_diskio_ebs_total_read_bytes",
                    "Unit": "Bytes",
                    "StorageResolution": 60
                },
                {
                    "Name": "node_diskio_ebs_total_read_time",
                    "Unit": "Second",
                    "StorageResolution": 60
                },
                {
                    "Name": "node_diskio_ebs_volume_performance_exceeded_tp",
                    "Unit": "Second",
                    "StorageResolution": 60
                },
                {
                    "Name": "node_diskio_ebs_ec2_instance_performance_exceeded_tp",
                    "Unit": "Second",
                    "StorageResolution": 60
                },
                {
                    "Name": "node_diskio_ebs_total_write_bytes",
                    "Unit": "Bytes",
                    "StorageResolution": 60
                },
                {
                    "Name": "node_diskio_ebs_total_write_time",
                    "Unit": "Second",
                    "StorageResolution": 60
                },
                {
                    "Name": "node_diskio_ebs_ec2_instance_performance_exceeded_iops",
                    "Unit": "Second",
                    "StorageResolution": 60
                },
                {
                    "Name": "node_diskio_ebs_volume_performance_exceeded_iops",
                    "Unit": "Second",
                    "StorageResolution": 60
                },
                {
                    "Name": "node_diskio_ebs_total_read_ops",
                    "Unit": "Count",
                    "StorageResolution": 60
                }
            ]
        }
    ],
    "ClusterName": "my-cluster-name",
    "InstanceId": "i-0aebb26742a731ee8",
    "InstanceType": "m5.2xlarge",
    "NodeName": "ip-100-64-144-129.us-east-2.compute.internal",
    "PlatformType": "AWS::EKS",
    "Timestamp": "1749678166150",
    "Type": "NodeEBS",
    "Version": "0",
    "VolumeId": "vol-084d2896789180f9c",
    "http.scheme": "http",
    "instance_id": "i-0aebb26742a731ee8",
    "k8s.namespace.name": "kube-system",
    "kubernetes": {
        "host": "ip-100-64-144-129.us-east-2.compute.internal"
    },
    "net.host.name": "ebs-csi-node.kube-system.svc",
    "net.host.port": "3302",
    "server.address": "ebs-csi-node.kube-system.svc",
    "server.port": "3302",
    "service.instance.id": "ebs-csi-node.kube-system.svc:3302",
    "service.name": "containerInsightsNVMeExporterScraper",
    "url.scheme": "http",
    "volume_id": "vol-084d2896789180f9c",
    "node_diskio_ebs_ec2_instance_performance_exceeded_iops": 0,
    "node_diskio_ebs_ec2_instance_performance_exceeded_tp": 0,
    "node_diskio_ebs_total_read_bytes": 1060864,
    "node_diskio_ebs_total_read_ops": 47,
    "node_diskio_ebs_total_read_time": 0.028466999999999132,
    "node_diskio_ebs_total_write_bytes": 1355776,
    "node_diskio_ebs_total_write_ops": 12,
    "node_diskio_ebs_total_write_time": 0.014161000000058266,
    "node_diskio_ebs_volume_performance_exceeded_iops": 0,
    "node_diskio_ebs_volume_performance_exceeded_tp": 0,
    "node_diskio_ebs_volume_queue_length": 0
}

@lisguo lisguo requested review from movence and jefchien June 11, 2025 22:12
@lisguo lisguo requested review from dricross and removed request for movence June 12, 2025 14:58
@lisguo lisguo marked this pull request as ready for review June 12, 2025 15:30
@lisguo lisguo requested a review from mxiamxia as a code owner June 12, 2025 15:30
@dricross
Copy link

emf exporter will split them into different emf log groups.

Do you mean that the exporter splits the metrics into separate log events within the same CloudWatch Log Group? Or do they actually go to separate CloudWatch log groups?

Comment on lines 95 to 98
if metadata.receiver == containerInsightsReceiver {
// For container insights, put all metrics in the same group regardless of type (ie gauge/counter)
metadata.groupedMetricMetadata.metricDataType = pmetric.MetricTypeEmpty
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am concerned about mixing other metric types into the same log event, e.g. histograms/summary with gauge/counters.

This will only group them into the same log event if they also have the same dimensions, right? If so, then I think it'd be ok.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't currently emit non-counter metrics for container insights today. But if one of the container insights prom scrapers (neuron/dcgm/nvme/etc) starts ingesting non-counter metrics, we want that all to be in the same log. Otherwise customers will be charged extra

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Histogram/summary metrics have extra dimensions like "quantile" which we wouldn't want applied to the counter/gauge metrics since that doesn't make any sense. So I think they will have to be in a separate log event.


if metadata.receiver == containerInsightsReceiver {
// For container insights, put all metrics in the same group regardless of type (ie guage/counter)
metadata.groupedMetricMetadata.metricDataType = pmetric.MetricTypeEmpty
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to test if this affects any other part of container insights. I see that the only other reference to the data type is in the translateGroupedMetricToCWMetric func:

        if isPrometheusMetric {
		fields[fieldPrometheusMetricType] = fieldPrometheusTypes[groupedMetric.metadata.metricDataType]
	}

code:

fields[fieldPrometheusMetricType] = fieldPrometheusTypes[groupedMetric.metadata.metricDataType]

sky333999
sky333999 previously approved these changes Jun 13, 2025
idx := 0

// Sort metadata list to prevent race condition
var metadataList []cWMetricMetadata
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was a flaky test. The github runner runs the unit tests with -race. Had to fix it by sorting the metadata list

sky333999
sky333999 previously approved these changes Jun 13, 2025
jefchien
jefchien previously approved these changes Jun 13, 2025
Co-authored-by: Jeffrey Chien <chienjef@amazon.com>
@lisguo lisguo dismissed stale reviews from jefchien and sky333999 via af63694 June 13, 2025 20:33
@lisguo lisguo merged commit 5b3f0a1 into aws-cwa-dev Jun 13, 2025
131 of 132 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants