Add scraper for NVME Container Insights prometheus metrics in EKS #296

zhihonl · 2025-03-27T16:50:30Z

Important

This PR assumes CSI driver will modify their Kubernetes service policy to be local. Otherwise the metrics routed from the service IP will be random.

Description

Customers who have ebs csi driver addon installed should see new ebs volume metrics when enabling container insights. This is not the current behavior since container insights receiver does not have any functionality to support this use case.

This PR adds prometheus scraper to scrape at port 3302 which is the default port exposed by CSI driver for disk metrics: https://github.com/kubernetes-sigs/aws-ebs-csi-driver/blob/master/docs/metrics.md#ebs-node-metrics

Link to tracking issue

[EKS] [aws-ebs-csi-driver][CloudWatch Observability]: Ingest Storage Metrics aws/containers-roadmap#2377

Testing

Unit test.

Manual Testing

EMF Output

{
    "AutoScalingGroupName": "eks-ng-27c93684-7ec8b6de-01de-1dd8-638d-256a326dd637",
    "CloudWatchMetrics": [
        {
            "Namespace": "ContainerInsights",
            "Dimensions": [
                [
                    "ClusterName"
                ],
                [
                    "ClusterName",
                    "InstanceId",
                    "NodeName"
                ],
                [
                    "ClusterName",
                    "InstanceId",
                    "NodeName",
                    "VolumeID"
                ]
            ],
            "Metrics": [
                {
                    "Name": "node_diskio_ebs_ec2_instance_performance_exceeded_iops",
                    "Unit": "Seconds",
                    "StorageResolution": 60
                },
                {
                    "Name": "node_diskio_ebs_total_read_bytes",
                    "Unit": "Bytes",
                    "StorageResolution": 60
                },
                {
                    "Name": "node_diskio_ebs_total_write_time",
                    "Unit": "Seconds",
                    "StorageResolution": 60
                },
                {
                    "Name": "node_diskio_ebs_total_read_ops",
                    "Unit": "Count",
                    "StorageResolution": 60
                },
                {
                    "Name": "node_diskio_ebs_total_read_time",
                    "Unit": "Seconds",
                    "StorageResolution": 60
                },
                {
                    "Name": "node_diskio_ebs_volume_performance_exceeded_tp",
                    "Unit": "Seconds",
                    "StorageResolution": 60
                },
                {
                    "Name": "node_diskio_ebs_volume_queue_length",
                    "Unit": "Count",
                    "StorageResolution": 60
                },
                {
                    "Name": "node_diskio_ebs_volume_performance_exceeded_iops",
                    "Unit": "Seconds",
                    "StorageResolution": 60
                },
                {
                    "Name": "node_diskio_ebs_ec2_instance_performance_exceeded_tp",
                    "Unit": "Seconds",
                    "StorageResolution": 60
                },
                {
                    "Name": "node_diskio_ebs_total_write_bytes",
                    "Unit": "Bytes",
                    "StorageResolution": 60
                },
                {
                    "Name": "node_diskio_ebs_total_write_ops",
                    "Unit": "Count",
                    "StorageResolution": 60
                }
            ]
        }
    ],
    "ClusterName": "compass-cluster-iad",
    "InstanceId": "i-123456789",
    "InstanceType": "m5.large",
    "NodeName": "ip-123-456-7-890.ec2.internal",
    "Timestamp": "1743094298457",
    "Version": "0",
    "VolumeID": "vol-123456789",
    "http.scheme": "http",
    "instance_id": "i-123456789",
    "k8s.namespace.name": "kube-system",
    "kubernetes": {
        "host": "ip-123-456-7-890.ec2.internal"
    },
    "net.host.name": "ebs-csi-node.kube-system.svc",
    "net.host.port": "3302",
    "server.address": "ebs-csi-node.kube-system.svc",
    "server.port": "3302",
    "service.instance.id": "ebs-csi-node.kube-system.svc:3302",
    "service.name": "containerInsightsNVMeExporterScraper",
    "url.scheme": "http",
    "volume_id": "vol-123456789",
    "node_diskio_ebs_ec2_instance_performance_exceeded_iops": 0,
    "node_diskio_ebs_ec2_instance_performance_exceeded_tp": 0,
    "node_diskio_ebs_total_read_bytes": 4437120000,
    "node_diskio_ebs_total_read_ops": 160206,
    "node_diskio_ebs_total_read_time": 100.642948,
    "node_diskio_ebs_total_write_bytes": 51040256,
    "node_diskio_ebs_total_write_ops": 213,
    "node_diskio_ebs_total_write_time": 0.611623,
    "node_diskio_ebs_volume_performance_exceeded_iops": 0,
    "node_diskio_ebs_volume_performance_exceeded_tp": 0,
    "node_diskio_ebs_volume_queue_length": 0
}

duhminick · 2025-03-27T17:57:29Z

receiver/awscontainerinsightreceiver/internal/nvme/nvmescraper_config.go

+			Action:       relabel.Replace,
+		},
+
+		// Below metrics are historgram which are not supported for container insights yet


Is there a way to configure the relabels such that we drop any metrics that might exist in the future? My concern is that if there is a new metric, we probably will miss the chance to rename it before it starts getting emitted.

Would be a concern to start emitting some metric, then start emitting it with a new name. Might cause some customer cnofusion

Fixed in new commit. Going to remove metric names with histogram pattern.

Oh, I think I didn't explain it properly. So we have all of these metrics prefixed with aws_ebs_csi_. If in the future EBS adds a new metric with the same prefix, then this scraper is gonna pick it up. The issue is that we won't have a rename/transformation

I don't think this is an issue anymore. All transformations will be done on agent side using metricstransformprocessor. EMF exporter has a configuration to emit only allowlisted metrics so it will never emit the metrics we don't know about anyway. So the responsibility for container insight receiver is really just to scrape any valid metrics.

internal/aws/containerinsight/k8sconst.go

internal/aws/containerinsight/const.go

lisguo · 2025-03-27T18:16:00Z

receiver/awscontainerinsightreceiver/internal/nvme/metric_unit.go

+	ebsExceededTPTime      = "aws_ebs_csi_exceeded_tp_seconds_total"
+	ebsExceededEC2IOPSTime = "aws_ebs_csi_ec2_exceeded_iops_seconds_total"
+	ebsExceededEC2TPTime   = "aws_ebs_csi_ec2_exceeded_tp_seconds_total"
+	ebsVolumeQueueLength   = "aws_ebs_csi_volume_queue_length"


There was some discussion if we can convert the histogram metrics:

node_diskio_ebs_read_io_latency node_diskio_ebs_write_io_latency

to statistics (avg/min/max). I think we already do this for api server metrics?

Seems like we already do this for some histograms - https://github.com/amazon-contributing/opentelemetry-collector-contrib/blob/aws-cwa-dev/exporter/awsemfexporter/datapoint.go#L188

In the NVME metrics one pager I see latency metrics are not included so not adding them here.

Sure, but I know there was some discussion on that doc with the container insights team if we could include the latency metrics since that would be valuable to customers and we don't want to have to add it later after we have launched

We can address this in a follow up -- but imo we should investigate if we can include the latency metrics as part of the launch

Agreed. This PR can be the baseline and histogram metrics we can implement it separately if needed.

lisguo · 2025-03-27T18:31:21Z

receiver/awscontainerinsightreceiver/internal/nvme/metric_unit.go

+	ebsVolumeQueueLength   = "aws_ebs_csi_volume_queue_length"
+
+	// Converted Names
+	nodeReadOpsTotal        = "node_diskio_ebs_total_read_ops"


do we want cumulative metrics?

I do see in the docs we have -total metrics for neuron and nvidia... https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/Container-Insights-metrics-enhanced-EKS.html

Might be worth checking with container insights team

In the NVME one-pager it's listed as one of the metrics so adding it here.

lisguo · 2025-03-27T18:31:45Z

receiver/awscontainerinsightreceiver/internal/nvme/metric_unit.go

+)
+
+var MetricToUnit = map[string]string{
+	nodeReadOpsTotal:        "Count",


make the units consts. Surprised these aren't already consts

Will address in next commit

Fixed in new commit

lisguo · 2025-03-27T18:32:46Z

receiver/awscontainerinsightreceiver/internal/nvme/nvmescraper_config.go

+			TargetLabel:  "__name__",
+			Regex:        relabel.MustNewRegexp(ebsReadOpsTotal),
+			Replacement:  nodeReadOpsTotal,
+			Action:       relabel.Replace,


is this what we do for the other scrapers? We setup relabel configs for each metric?

https://github.com/aws/amazon-cloudwatch-agent/blob/f51ad4ac8a18a64f5c55878b4062c4b3c5837da1/translator/tocwconfig/sampleConfig/emf_and_kubernetes_with_gpu_config.yaml#L697

Looks like the metric is converted on agent repo as part of the translation. I can move the logic there instead.

… EKS (#296)" This reverts commit 36a8d2f.

… EKS (#296)" (#308) This reverts commit 36a8d2f.

* Add scraper for NVME Container Insights prometheus metrics in EKS * Remove unnecessary code * Fix linter error * Generalize relabel configs and fix naming conventions * Change unit map names

…) (#309) * Add scraper for NVME Container Insights prometheus metrics in EKS (#296) * Add scraper for NVME Container Insights prometheus metrics in EKS * Remove unnecessary code * Fix linter error * Generalize relabel configs and fix naming conventions * Change unit map names * Fix unit test failures

zhihonl added 2 commits March 26, 2025 11:13

Add scraper for NVME Container Insights prometheus metrics in EKS

520158e

Remove unnecessary code

aa9d512

zhihonl requested a review from mxiamxia as a code owner March 27, 2025 16:50

Fix linter error

67e7885

duhminick reviewed Mar 27, 2025

View reviewed changes

lisguo reviewed Mar 27, 2025

View reviewed changes

Generalize relabel configs and fix naming conventions

e6b1e4c

zhihonl force-pushed the nvme-metric branch from 0e5c705 to e6b1e4c Compare March 27, 2025 19:04

Change unit map names

a60fa1c

duhminick approved these changes Mar 28, 2025

View reviewed changes

Merge branch 'aws-cwa-dev' into nvme-metric

1f7ea25

lisguo approved these changes Mar 28, 2025

View reviewed changes

Merge branch 'aws-cwa-dev' into nvme-metric

e5056f1

zhihonl merged commit 36a8d2f into aws-cwa-dev Mar 28, 2025
128 of 147 checks passed

zhihonl mentioned this pull request Mar 28, 2025

Add translation logic for NVME metrics aws/amazon-cloudwatch-agent#1625

Merged

musa-asad mentioned this pull request Apr 7, 2025

Sync contrib components to v0.0.0-20250414174532-cb2f77072864 aws/amazon-cloudwatch-agent#1640

Merged

musa-asad added a commit that referenced this pull request Apr 14, 2025

Revert "Add scraper for NVME Container Insights prometheus metrics in…

d89b775

… EKS (#296)" This reverts commit 36a8d2f.

musa-asad added a commit that referenced this pull request Apr 14, 2025

Revert "Add scraper for NVME Container Insights prometheus metrics in…

cb2f770

… EKS (#296)" (#308) This reverts commit 36a8d2f.

Add scraper for NVME Container Insights prometheus metrics in EKS #296

Add scraper for NVME Container Insights prometheus metrics in EKS #296

Uh oh!

Conversation

zhihonl commented Mar 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Link to tracking issue

Testing

Manual Testing

EMF Output

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

duhminick Mar 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zhihonl Mar 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

zhihonl commented Mar 27, 2025 •

edited

Loading

duhminick Mar 28, 2025 •

edited

Loading

zhihonl Mar 28, 2025 •

edited

Loading