Add scraper for NVME Container Insights prometheus metrics in EKS (#296) #309

zhihonl · 2025-04-22T16:22:12Z

Important

This PR assumes CSI driver will modify their Kubernetes service policy to be local. Otherwise the metrics routed from the service IP will be random.

Description

Customers who have ebs csi driver addon installed should see new ebs volume metrics when enabling container insights. This is not the current behavior since container insights receiver does not have any functionality to support this use case.

This PR adds prometheus scraper to scrape at port 3302 which is the default port exposed by CSI driver for disk metrics: https://github.com/kubernetes-sigs/aws-ebs-csi-driver/blob/master/docs/metrics.md#ebs-node-metrics

Link to tracking issue

[EKS] [aws-ebs-csi-driver][CloudWatch Observability]: Ingest Storage Metrics aws/containers-roadmap#2377

Testing

Unit test.

Manual Testing

EMF Output

{
    "AutoScalingGroupName": "eks-ng-27c93684-7ec8b6de-01de-1dd8-638d-256a326dd637",
    "CloudWatchMetrics": [
        {
            "Namespace": "ContainerInsights",
            "Dimensions": [
                [
                    "ClusterName"
                ],
                [
                    "ClusterName",
                    "InstanceId",
                    "NodeName"
                ],
                [
                    "ClusterName",
                    "InstanceId",
                    "NodeName",
                    "VolumeID"
                ]
            ],
            "Metrics": [
                {
                    "Name": "node_diskio_ebs_ec2_instance_performance_exceeded_iops",
                    "Unit": "Seconds",
                    "StorageResolution": 60
                },
                {
                    "Name": "node_diskio_ebs_total_read_bytes",
                    "Unit": "Bytes",
                    "StorageResolution": 60
                },
                {
                    "Name": "node_diskio_ebs_total_write_time",
                    "Unit": "Seconds",
                    "StorageResolution": 60
                },
                {
                    "Name": "node_diskio_ebs_total_read_ops",
                    "Unit": "Count",
                    "StorageResolution": 60
                },
                {
                    "Name": "node_diskio_ebs_total_read_time",
                    "Unit": "Seconds",
                    "StorageResolution": 60
                },
                {
                    "Name": "node_diskio_ebs_volume_performance_exceeded_tp",
                    "Unit": "Seconds",
                    "StorageResolution": 60
                },
                {
                    "Name": "node_diskio_ebs_volume_queue_length",
                    "Unit": "Count",
                    "StorageResolution": 60
                },
                {
                    "Name": "node_diskio_ebs_volume_performance_exceeded_iops",
                    "Unit": "Seconds",
                    "StorageResolution": 60
                },
                {
                    "Name": "node_diskio_ebs_ec2_instance_performance_exceeded_tp",
                    "Unit": "Seconds",
                    "StorageResolution": 60
                },
                {
                    "Name": "node_diskio_ebs_total_write_bytes",
                    "Unit": "Bytes",
                    "StorageResolution": 60
                },
                {
                    "Name": "node_diskio_ebs_total_write_ops",
                    "Unit": "Count",
                    "StorageResolution": 60
                }
            ]
        }
    ],
    "ClusterName": "compass-cluster-iad",
    "InstanceId": "i-123456789",
    "InstanceType": "m5.large",
    "NodeName": "ip-123-456-7-890.ec2.internal",
    "Timestamp": "1743094298457",
    "Version": "0",
    "VolumeID": "vol-123456789",
    "http.scheme": "http",
    "instance_id": "i-123456789",
    "k8s.namespace.name": "kube-system",
    "kubernetes": {
        "host": "ip-123-456-7-890.ec2.internal"
    },
    "net.host.name": "ebs-csi-node.kube-system.svc",
    "net.host.port": "3302",
    "server.address": "ebs-csi-node.kube-system.svc",
    "server.port": "3302",
    "service.instance.id": "ebs-csi-node.kube-system.svc:3302",
    "service.name": "containerInsightsNVMeExporterScraper",
    "url.scheme": "http",
    "volume_id": "vol-123456789",
    "node_diskio_ebs_ec2_instance_performance_exceeded_iops": 0,
    "node_diskio_ebs_ec2_instance_performance_exceeded_tp": 0,
    "node_diskio_ebs_total_read_bytes": 4437120000,
    "node_diskio_ebs_total_read_ops": 160206,
    "node_diskio_ebs_total_read_time": 100.642948,
    "node_diskio_ebs_total_write_bytes": 51040256,
    "node_diskio_ebs_total_write_ops": 213,
    "node_diskio_ebs_total_write_time": 0.611623,
    "node_diskio_ebs_volume_performance_exceeded_iops": 0,
    "node_diskio_ebs_volume_performance_exceeded_tp": 0,
    "node_diskio_ebs_volume_queue_length": 0
}

* Add scraper for NVME Container Insights prometheus metrics in EKS * Remove unnecessary code * Fix linter error * Generalize relabel configs and fix naming conventions * Change unit map names

duhminick · 2025-04-22T16:35:46Z

Adding this comment as reference: aws/amazon-cloudwatch-agent#1625 (comment).

I think we should probably use this method as opposed to using a cumulativetodelta processor (as long as it works as expected -- from my testing it did seem to be the case)

lisguo · 2025-04-23T14:15:59Z

receiver/awscontainerinsightreceiver/internal/nvme/nvmescraper_config.go

+			Regex:        relabel.MustNewRegexp(".*_bucket|.*_sum|.*_count.*"),
+			Action:       relabel.Drop,
+		},
+		// Hacky way to inject static values (clusterName/instanceId/nodeName/volumeID)


remind me -- do we do this anywhere else?

Where we use relabel configs to inject clustername, instanceid, etc?

Yeah we do similar stuff in DCGM scraper:

opentelemetry-collector-contrib/receiver/awscontainerinsightreceiver/internal/gpu/dcgmscraper_config.go

Line 86 in 1004f32

// hacky way to inject static values (clusterName/instanceId/instanceType)

lisguo · 2025-04-23T14:17:37Z

receiver/awscontainerinsightreceiver/internal/nvme/nvmescraper_config.go

+			Action:       relabel.Keep,
+		},
+
+		// Below metrics are histogram type which are not supported for container insights yet


Let's make sure we have a TODO item in the backlog for this. Once we support prom histograms we should add these back

I think @duhminick did investigations on this and it was proposed to not pursue this metric because the emitted metric isn't actually useful.

lisguo · 2025-04-23T14:18:14Z

receiver/awscontainerinsightreceiver/internal/nvme/nvmescraper_config.go

+	}
+}
+
+func getMetricRelabelConfig(hostInfoProvider hostInfoProvider) []*relabel.Config {


Why do we need relabel configs here? Can't we just take all the metrics scraped from the endpoint?

Sorry if this was answered before -- I can't remember the previous PR

It's for relabeling the dimensions to the values we actually want to see on CloudWatch. You could scrape them as it is and use another processor to relabel them, but just following DCGM pattern:

opentelemetry-collector-contrib/receiver/awscontainerinsightreceiver/internal/gpu/dcgmscraper_config.go

Line 65 in 1004f32

func getMetricRelabelConfig(hostInfoProvider hostInfoProvider) []*relabel.Config {

duhminick · 2025-04-24T14:46:41Z

I know this works well, so I'm okay with approving it. Though I still wonder what the behavior of is (outside of histogram stuff that we had talked about):

opentelemetry-collector-contrib/exporter/awsemfexporter/metric_translator.go

Lines 144 to 153 in cb2f770

    
           if serviceName, ok := rm.Resource().Attributes().Get("service.name"); ok { 
        
           	if strings.HasPrefix(serviceName.Str(), "containerInsightsKubeAPIServerScraper") || 
        
           		strings.HasPrefix(serviceName.Str(), "containerInsightsDCGMExporterScraper") || 
        
           		strings.HasPrefix(serviceName.Str(), "containerInsightsNeuronMonitorScraper") || 
        
           		strings.HasPrefix(serviceName.Str(), "containerInsightsKueueMetricsScraper") { 
        
           		// the prometheus metrics that come from the container insight receiver need to be clearly tagged as coming from container insights 
        
           		metricReceiver = containerInsightsReceiver 
        
           	} 
        
           }

Add scraper for NVME Container Insights prometheus metrics in EKS (#296)

ef13bae

* Add scraper for NVME Container Insights prometheus metrics in EKS * Remove unnecessary code * Fix linter error * Generalize relabel configs and fix naming conventions * Change unit map names

zhihonl requested a review from mxiamxia as a code owner April 22, 2025 16:22

Fix unit test failures

50aaf16

lisguo reviewed Apr 23, 2025

View reviewed changes

lisguo approved these changes Apr 23, 2025

View reviewed changes

duhminick approved these changes Apr 24, 2025

View reviewed changes

zhihonl merged commit 54ac70e into aws-cwa-dev Apr 24, 2025
127 of 147 checks passed

zhihonl mentioned this pull request May 9, 2025

Sync OTEL Contrib dependencies to latest 'aws-cwa-dev' branch aws/amazon-cloudwatch-agent#1681

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add scraper for NVME Container Insights prometheus metrics in EKS (#296) #309

Add scraper for NVME Container Insights prometheus metrics in EKS (#296) #309

Uh oh!

zhihonl commented Apr 22, 2025 •

edited

Loading

Uh oh!

duhminick commented Apr 22, 2025

Uh oh!

lisguo Apr 23, 2025

Uh oh!

zhihonl Apr 23, 2025

Uh oh!

lisguo Apr 23, 2025

Uh oh!

zhihonl Apr 23, 2025

Uh oh!

lisguo Apr 23, 2025

Uh oh!

zhihonl Apr 23, 2025

Uh oh!

duhminick commented Apr 24, 2025

Uh oh!

Uh oh!

Uh oh!

Add scraper for NVME Container Insights prometheus metrics in EKS (#296) #309

Add scraper for NVME Container Insights prometheus metrics in EKS (#296) #309

Uh oh!

Conversation

zhihonl commented Apr 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Link to tracking issue

Testing

Manual Testing

EMF Output

Uh oh!

duhminick commented Apr 22, 2025

Uh oh!

lisguo Apr 23, 2025

Choose a reason for hiding this comment

Uh oh!

zhihonl Apr 23, 2025

Choose a reason for hiding this comment

Uh oh!

lisguo Apr 23, 2025

Choose a reason for hiding this comment

Uh oh!

zhihonl Apr 23, 2025

Choose a reason for hiding this comment

Uh oh!

lisguo Apr 23, 2025

Choose a reason for hiding this comment

Uh oh!

zhihonl Apr 23, 2025

Choose a reason for hiding this comment

Uh oh!

duhminick commented Apr 24, 2025

Uh oh!

Uh oh!

Uh oh!

zhihonl commented Apr 22, 2025 •

edited

Loading