Skip to content

Add scraper for NVME Container Insights prometheus metrics in EKS (#296) #309

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Apr 24, 2025

Conversation

zhihonl
Copy link

@zhihonl zhihonl commented Apr 22, 2025

Important

This PR assumes CSI driver will modify their Kubernetes service policy to be local. Otherwise the metrics routed from the service IP will be random.

Description

Customers who have ebs csi driver addon installed should see new ebs volume metrics when enabling container insights. This is not the current behavior since container insights receiver does not have any functionality to support this use case.

This PR adds prometheus scraper to scrape at port 3302 which is the default port exposed by CSI driver for disk metrics: https://github.com/kubernetes-sigs/aws-ebs-csi-driver/blob/master/docs/metrics.md#ebs-node-metrics

Link to tracking issue

Testing

Unit test.

Manual Testing

EMF Output

{
    "AutoScalingGroupName": "eks-ng-27c93684-7ec8b6de-01de-1dd8-638d-256a326dd637",
    "CloudWatchMetrics": [
        {
            "Namespace": "ContainerInsights",
            "Dimensions": [
                [
                    "ClusterName"
                ],
                [
                    "ClusterName",
                    "InstanceId",
                    "NodeName"
                ],
                [
                    "ClusterName",
                    "InstanceId",
                    "NodeName",
                    "VolumeID"
                ]
            ],
            "Metrics": [
                {
                    "Name": "node_diskio_ebs_ec2_instance_performance_exceeded_iops",
                    "Unit": "Seconds",
                    "StorageResolution": 60
                },
                {
                    "Name": "node_diskio_ebs_total_read_bytes",
                    "Unit": "Bytes",
                    "StorageResolution": 60
                },
                {
                    "Name": "node_diskio_ebs_total_write_time",
                    "Unit": "Seconds",
                    "StorageResolution": 60
                },
                {
                    "Name": "node_diskio_ebs_total_read_ops",
                    "Unit": "Count",
                    "StorageResolution": 60
                },
                {
                    "Name": "node_diskio_ebs_total_read_time",
                    "Unit": "Seconds",
                    "StorageResolution": 60
                },
                {
                    "Name": "node_diskio_ebs_volume_performance_exceeded_tp",
                    "Unit": "Seconds",
                    "StorageResolution": 60
                },
                {
                    "Name": "node_diskio_ebs_volume_queue_length",
                    "Unit": "Count",
                    "StorageResolution": 60
                },
                {
                    "Name": "node_diskio_ebs_volume_performance_exceeded_iops",
                    "Unit": "Seconds",
                    "StorageResolution": 60
                },
                {
                    "Name": "node_diskio_ebs_ec2_instance_performance_exceeded_tp",
                    "Unit": "Seconds",
                    "StorageResolution": 60
                },
                {
                    "Name": "node_diskio_ebs_total_write_bytes",
                    "Unit": "Bytes",
                    "StorageResolution": 60
                },
                {
                    "Name": "node_diskio_ebs_total_write_ops",
                    "Unit": "Count",
                    "StorageResolution": 60
                }
            ]
        }
    ],
    "ClusterName": "compass-cluster-iad",
    "InstanceId": "i-123456789",
    "InstanceType": "m5.large",
    "NodeName": "ip-123-456-7-890.ec2.internal",
    "Timestamp": "1743094298457",
    "Version": "0",
    "VolumeID": "vol-123456789",
    "http.scheme": "http",
    "instance_id": "i-123456789",
    "k8s.namespace.name": "kube-system",
    "kubernetes": {
        "host": "ip-123-456-7-890.ec2.internal"
    },
    "net.host.name": "ebs-csi-node.kube-system.svc",
    "net.host.port": "3302",
    "server.address": "ebs-csi-node.kube-system.svc",
    "server.port": "3302",
    "service.instance.id": "ebs-csi-node.kube-system.svc:3302",
    "service.name": "containerInsightsNVMeExporterScraper",
    "url.scheme": "http",
    "volume_id": "vol-123456789",
    "node_diskio_ebs_ec2_instance_performance_exceeded_iops": 0,
    "node_diskio_ebs_ec2_instance_performance_exceeded_tp": 0,
    "node_diskio_ebs_total_read_bytes": 4437120000,
    "node_diskio_ebs_total_read_ops": 160206,
    "node_diskio_ebs_total_read_time": 100.642948,
    "node_diskio_ebs_total_write_bytes": 51040256,
    "node_diskio_ebs_total_write_ops": 213,
    "node_diskio_ebs_total_write_time": 0.611623,
    "node_diskio_ebs_volume_performance_exceeded_iops": 0,
    "node_diskio_ebs_volume_performance_exceeded_tp": 0,
    "node_diskio_ebs_volume_queue_length": 0
}
Screenshot 2025-03-27 at 12 54 55 PM

* Add scraper for NVME Container Insights prometheus metrics in EKS

* Remove unnecessary code

* Fix linter error

* Generalize relabel configs and fix naming conventions

* Change unit map names
@zhihonl zhihonl requested a review from mxiamxia as a code owner April 22, 2025 16:22
@duhminick
Copy link

Adding this comment as reference: aws/amazon-cloudwatch-agent#1625 (comment).

I think we should probably use this method as opposed to using a cumulativetodelta processor (as long as it works as expected -- from my testing it did seem to be the case)

Regex: relabel.MustNewRegexp(".*_bucket|.*_sum|.*_count.*"),
Action: relabel.Drop,
},
// Hacky way to inject static values (clusterName/instanceId/nodeName/volumeID)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remind me -- do we do this anywhere else?

Where we use relabel configs to inject clustername, instanceid, etc?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah we do similar stuff in DCGM scraper:

// hacky way to inject static values (clusterName/instanceId/instanceType)

Action: relabel.Keep,
},

// Below metrics are histogram type which are not supported for container insights yet
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's make sure we have a TODO item in the backlog for this. Once we support prom histograms we should add these back

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think @duhminick did investigations on this and it was proposed to not pursue this metric because the emitted metric isn't actually useful.

}
}

func getMetricRelabelConfig(hostInfoProvider hostInfoProvider) []*relabel.Config {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need relabel configs here? Can't we just take all the metrics scraped from the endpoint?

Sorry if this was answered before -- I can't remember the previous PR

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's for relabeling the dimensions to the values we actually want to see on CloudWatch. You could scrape them as it is and use another processor to relabel them, but just following DCGM pattern:

func getMetricRelabelConfig(hostInfoProvider hostInfoProvider) []*relabel.Config {

@duhminick
Copy link

I know this works well, so I'm okay with approving it. Though I still wonder what the behavior of is (outside of histogram stuff that we had talked about):

if serviceName, ok := rm.Resource().Attributes().Get("service.name"); ok {
if strings.HasPrefix(serviceName.Str(), "containerInsightsKubeAPIServerScraper") ||
strings.HasPrefix(serviceName.Str(), "containerInsightsDCGMExporterScraper") ||
strings.HasPrefix(serviceName.Str(), "containerInsightsNeuronMonitorScraper") ||
strings.HasPrefix(serviceName.Str(), "containerInsightsKueueMetricsScraper") {
// the prometheus metrics that come from the container insight receiver need to be clearly tagged as coming from container insights
metricReceiver = containerInsightsReceiver
}
}

@zhihonl zhihonl merged commit 54ac70e into aws-cwa-dev Apr 24, 2025
127 of 147 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants