Skip to content

Add scraper for NVME Container Insights prometheus metrics in EKS #296

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Mar 28, 2025

Conversation

zhihonl
Copy link

@zhihonl zhihonl commented Mar 27, 2025

Important

This PR assumes CSI driver will modify their Kubernetes service policy to be local. Otherwise the metrics routed from the service IP will be random.

Description

Customers who have ebs csi driver addon installed should see new ebs volume metrics when enabling container insights. This is not the current behavior since container insights receiver does not have any functionality to support this use case.

This PR adds prometheus scraper to scrape at port 3302 which is the default port exposed by CSI driver for disk metrics: https://github.com/kubernetes-sigs/aws-ebs-csi-driver/blob/master/docs/metrics.md#ebs-node-metrics

Link to tracking issue

Testing

Unit test.

Manual Testing

EMF Output

{
    "AutoScalingGroupName": "eks-ng-27c93684-7ec8b6de-01de-1dd8-638d-256a326dd637",
    "CloudWatchMetrics": [
        {
            "Namespace": "ContainerInsights",
            "Dimensions": [
                [
                    "ClusterName"
                ],
                [
                    "ClusterName",
                    "InstanceId",
                    "NodeName"
                ],
                [
                    "ClusterName",
                    "InstanceId",
                    "NodeName",
                    "VolumeID"
                ]
            ],
            "Metrics": [
                {
                    "Name": "node_diskio_ebs_ec2_instance_performance_exceeded_iops",
                    "Unit": "Seconds",
                    "StorageResolution": 60
                },
                {
                    "Name": "node_diskio_ebs_total_read_bytes",
                    "Unit": "Bytes",
                    "StorageResolution": 60
                },
                {
                    "Name": "node_diskio_ebs_total_write_time",
                    "Unit": "Seconds",
                    "StorageResolution": 60
                },
                {
                    "Name": "node_diskio_ebs_total_read_ops",
                    "Unit": "Count",
                    "StorageResolution": 60
                },
                {
                    "Name": "node_diskio_ebs_total_read_time",
                    "Unit": "Seconds",
                    "StorageResolution": 60
                },
                {
                    "Name": "node_diskio_ebs_volume_performance_exceeded_tp",
                    "Unit": "Seconds",
                    "StorageResolution": 60
                },
                {
                    "Name": "node_diskio_ebs_volume_queue_length",
                    "Unit": "Count",
                    "StorageResolution": 60
                },
                {
                    "Name": "node_diskio_ebs_volume_performance_exceeded_iops",
                    "Unit": "Seconds",
                    "StorageResolution": 60
                },
                {
                    "Name": "node_diskio_ebs_ec2_instance_performance_exceeded_tp",
                    "Unit": "Seconds",
                    "StorageResolution": 60
                },
                {
                    "Name": "node_diskio_ebs_total_write_bytes",
                    "Unit": "Bytes",
                    "StorageResolution": 60
                },
                {
                    "Name": "node_diskio_ebs_total_write_ops",
                    "Unit": "Count",
                    "StorageResolution": 60
                }
            ]
        }
    ],
    "ClusterName": "compass-cluster-iad",
    "InstanceId": "i-123456789",
    "InstanceType": "m5.large",
    "NodeName": "ip-123-456-7-890.ec2.internal",
    "Timestamp": "1743094298457",
    "Version": "0",
    "VolumeID": "vol-123456789",
    "http.scheme": "http",
    "instance_id": "i-123456789",
    "k8s.namespace.name": "kube-system",
    "kubernetes": {
        "host": "ip-123-456-7-890.ec2.internal"
    },
    "net.host.name": "ebs-csi-node.kube-system.svc",
    "net.host.port": "3302",
    "server.address": "ebs-csi-node.kube-system.svc",
    "server.port": "3302",
    "service.instance.id": "ebs-csi-node.kube-system.svc:3302",
    "service.name": "containerInsightsNVMeExporterScraper",
    "url.scheme": "http",
    "volume_id": "vol-123456789",
    "node_diskio_ebs_ec2_instance_performance_exceeded_iops": 0,
    "node_diskio_ebs_ec2_instance_performance_exceeded_tp": 0,
    "node_diskio_ebs_total_read_bytes": 4437120000,
    "node_diskio_ebs_total_read_ops": 160206,
    "node_diskio_ebs_total_read_time": 100.642948,
    "node_diskio_ebs_total_write_bytes": 51040256,
    "node_diskio_ebs_total_write_ops": 213,
    "node_diskio_ebs_total_write_time": 0.611623,
    "node_diskio_ebs_volume_performance_exceeded_iops": 0,
    "node_diskio_ebs_volume_performance_exceeded_tp": 0,
    "node_diskio_ebs_volume_queue_length": 0
}
Screenshot 2025-03-27 at 12 54 55 PM

@zhihonl zhihonl requested a review from mxiamxia as a code owner March 27, 2025 16:50
Action: relabel.Replace,
},

// Below metrics are historgram which are not supported for container insights yet

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way to configure the relabels such that we drop any metrics that might exist in the future? My concern is that if there is a new metric, we probably will miss the chance to rename it before it starts getting emitted.

Would be a concern to start emitting some metric, then start emitting it with a new name. Might cause some customer cnofusion

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in new commit. Going to remove metric names with histogram pattern.

Copy link

@duhminick duhminick Mar 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I think I didn't explain it properly. So we have all of these metrics prefixed with aws_ebs_csi_. If in the future EBS adds a new metric with the same prefix, then this scraper is gonna pick it up. The issue is that we won't have a rename/transformation

Copy link
Author

@zhihonl zhihonl Mar 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is an issue anymore. All transformations will be done on agent side using metricstransformprocessor. EMF exporter has a configuration to emit only allowlisted metrics so it will never emit the metrics we don't know about anyway. So the responsibility for container insight receiver is really just to scrape any valid metrics.

ebsExceededTPTime = "aws_ebs_csi_exceeded_tp_seconds_total"
ebsExceededEC2IOPSTime = "aws_ebs_csi_ec2_exceeded_iops_seconds_total"
ebsExceededEC2TPTime = "aws_ebs_csi_ec2_exceeded_tp_seconds_total"
ebsVolumeQueueLength = "aws_ebs_csi_volume_queue_length"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was some discussion if we can convert the histogram metrics:

node_diskio_ebs_read_io_latency
node_diskio_ebs_write_io_latency

to statistics (avg/min/max). I think we already do this for api server metrics?

Seems like we already do this for some histograms - https://github.com/amazon-contributing/opentelemetry-collector-contrib/blob/aws-cwa-dev/exporter/awsemfexporter/datapoint.go#L188

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the NVME metrics one pager I see latency metrics are not included so not adding them here.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, but I know there was some discussion on that doc with the container insights team if we could include the latency metrics since that would be valuable to customers and we don't want to have to add it later after we have launched

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can address this in a follow up -- but imo we should investigate if we can include the latency metrics as part of the launch

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. This PR can be the baseline and histogram metrics we can implement it separately if needed.

ebsVolumeQueueLength = "aws_ebs_csi_volume_queue_length"

// Converted Names
nodeReadOpsTotal = "node_diskio_ebs_total_read_ops"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we want cumulative metrics?

I do see in the docs we have -total metrics for neuron and nvidia... https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/Container-Insights-metrics-enhanced-EKS.html

Might be worth checking with container insights team

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the NVME one-pager it's listed as one of the metrics so adding it here.

)

var MetricToUnit = map[string]string{
nodeReadOpsTotal: "Count",
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make the units consts. Surprised these aren't already consts

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will address in next commit

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in new commit

TargetLabel: "__name__",
Regex: relabel.MustNewRegexp(ebsReadOpsTotal),
Replacement: nodeReadOpsTotal,
Action: relabel.Replace,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this what we do for the other scrapers? We setup relabel configs for each metric?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/aws/amazon-cloudwatch-agent/blob/f51ad4ac8a18a64f5c55878b4062c4b3c5837da1/translator/tocwconfig/sampleConfig/emf_and_kubernetes_with_gpu_config.yaml#L697

Looks like the metric is converted on agent repo as part of the translation. I can move the logic there instead.

@zhihonl zhihonl merged commit 36a8d2f into aws-cwa-dev Mar 28, 2025
128 of 147 checks passed
musa-asad added a commit that referenced this pull request Apr 14, 2025
musa-asad added a commit that referenced this pull request Apr 14, 2025
zhihonl added a commit that referenced this pull request Apr 22, 2025
* Add scraper for NVME Container Insights prometheus metrics in EKS

* Remove unnecessary code

* Fix linter error

* Generalize relabel configs and fix naming conventions

* Change unit map names
zhihonl added a commit that referenced this pull request Apr 24, 2025
…) (#309)

* Add scraper for NVME Container Insights prometheus metrics in EKS (#296)

* Add scraper for NVME Container Insights prometheus metrics in EKS

* Remove unnecessary code

* Fix linter error

* Generalize relabel configs and fix naming conventions

* Change unit map names

* Fix unit test failures
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants