-
Notifications
You must be signed in to change notification settings - Fork 27
Add scraper for NVME Container Insights prometheus metrics in EKS #296
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Action: relabel.Replace, | ||
}, | ||
|
||
// Below metrics are historgram which are not supported for container insights yet |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a way to configure the relabels such that we drop any metrics that might exist in the future? My concern is that if there is a new metric, we probably will miss the chance to rename it before it starts getting emitted.
Would be a concern to start emitting some metric, then start emitting it with a new name. Might cause some customer cnofusion
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in new commit. Going to remove metric names with histogram pattern.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I think I didn't explain it properly. So we have all of these metrics prefixed with aws_ebs_csi_. If in the future EBS adds a new metric with the same prefix, then this scraper is gonna pick it up. The issue is that we won't have a rename/transformation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is an issue anymore. All transformations will be done on agent side using metricstransformprocessor. EMF exporter has a configuration to emit only allowlisted metrics so it will never emit the metrics we don't know about anyway. So the responsibility for container insight receiver is really just to scrape any valid metrics.
ebsExceededTPTime = "aws_ebs_csi_exceeded_tp_seconds_total" | ||
ebsExceededEC2IOPSTime = "aws_ebs_csi_ec2_exceeded_iops_seconds_total" | ||
ebsExceededEC2TPTime = "aws_ebs_csi_ec2_exceeded_tp_seconds_total" | ||
ebsVolumeQueueLength = "aws_ebs_csi_volume_queue_length" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was some discussion if we can convert the histogram metrics:
node_diskio_ebs_read_io_latency
node_diskio_ebs_write_io_latency
to statistics (avg/min/max). I think we already do this for api server metrics?
Seems like we already do this for some histograms - https://github.com/amazon-contributing/opentelemetry-collector-contrib/blob/aws-cwa-dev/exporter/awsemfexporter/datapoint.go#L188
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the NVME metrics one pager I see latency metrics are not included so not adding them here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, but I know there was some discussion on that doc with the container insights team if we could include the latency metrics since that would be valuable to customers and we don't want to have to add it later after we have launched
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can address this in a follow up -- but imo we should investigate if we can include the latency metrics as part of the launch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. This PR can be the baseline and histogram metrics we can implement it separately if needed.
ebsVolumeQueueLength = "aws_ebs_csi_volume_queue_length" | ||
|
||
// Converted Names | ||
nodeReadOpsTotal = "node_diskio_ebs_total_read_ops" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we want cumulative metrics?
I do see in the docs we have -total metrics for neuron and nvidia... https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/Container-Insights-metrics-enhanced-EKS.html
Might be worth checking with container insights team
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the NVME one-pager it's listed as one of the metrics so adding it here.
) | ||
|
||
var MetricToUnit = map[string]string{ | ||
nodeReadOpsTotal: "Count", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make the units consts. Surprised these aren't already consts
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will address in next commit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in new commit
TargetLabel: "__name__", | ||
Regex: relabel.MustNewRegexp(ebsReadOpsTotal), | ||
Replacement: nodeReadOpsTotal, | ||
Action: relabel.Replace, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this what we do for the other scrapers? We setup relabel configs for each metric?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like the metric is converted on agent repo as part of the translation. I can move the logic there instead.
* Add scraper for NVME Container Insights prometheus metrics in EKS * Remove unnecessary code * Fix linter error * Generalize relabel configs and fix naming conventions * Change unit map names
…) (#309) * Add scraper for NVME Container Insights prometheus metrics in EKS (#296) * Add scraper for NVME Container Insights prometheus metrics in EKS * Remove unnecessary code * Fix linter error * Generalize relabel configs and fix naming conventions * Change unit map names * Fix unit test failures
Important
This PR assumes CSI driver will modify their Kubernetes service policy to be local. Otherwise the metrics routed from the service IP will be random.
Description
Customers who have ebs csi driver addon installed should see new ebs volume metrics when enabling container insights. This is not the current behavior since container insights receiver does not have any functionality to support this use case.
This PR adds prometheus scraper to scrape at port 3302 which is the default port exposed by CSI driver for disk metrics: https://github.com/kubernetes-sigs/aws-ebs-csi-driver/blob/master/docs/metrics.md#ebs-node-metrics
Link to tracking issue
Testing
Unit test.
Manual Testing
EMF Output