Skip to content

Fixing EMF output for detailed summary metrics #313

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
May 13, 2025

Conversation

dricross
Copy link

@dricross dricross commented May 9, 2025

Description

Enabling the “DetailedMetrics” configuration for the awsemfexporter causes this OTeL Prometheus receiver → awsemfexporter pipeline to output n + 2 log events:

  • One log event for each quantile
  • One log event for count metric
  • One log event for sum metric

The original implementation for the DetailedMetrics functionality outputs count/sum metrics as one log event instead of two:

  • One log event for each quantile
  • One log event for count/sum metric

A recent change to the awsemfexporter caused the groupKey for count and summary metrics to be unique so they are no longer grouped together and therefore emitted as separate log events.

This PR restore the original functionality.

Testing

Followed something similar to the OTLP -> AMP integration test (https://github.com/aws/amazon-cloudwatch-agent-test/blob/main/test/amp/resources/otlp_pusher.sh) but configured the agent to push to EMF instead of AMP.

Use following OTLP metrics file with summary metric only:

{
  "resourceMetrics": [
    {
      "resource": {
        "attributes": [
          {
            "key": "service.name",
            "value": {
              "stringValue": "my.service"
            }
          }
        ]
      },
      "scopeMetrics": [
        {
          "scope": {
            "name": "my.library",
            "version": "1.0.0",
            "attributes": [
              {
                "key": "my.scope.attribute",
                "value": {
                  "stringValue": "some scope attribute"
                }
              }
            ]
          },
          "metrics": [
            {
              "name": "my.summary",
              "unit": "ms",
              "description": "I am a Summary metric",
              "summary": {
                "dataPoints": [
                  {
                    "startTimeUnixNano": START_TIME,
                    "timeUnixNano": START_TIME,
                    "count": 100,
                    "sum": 5000.0,
                    "quantileValues": [
                      {
                        "quantile": 0.50,
                        "value": 43.5
                      },
                      {
                        "quantile": 0.95,
                        "value": 95.2
                      },
                      {
                        "quantile": 0.99,
                        "value": 97.8
                      }
                    ],
                    "attributes": [
                      {
                        "key": "my.summary.attr",
                        "value": {
                          "stringValue": "some value"
                        }
                      }
                    ]
                  }
                ]
              }
            }
          ]
        }
      ]
    }
  ]
}

Start with this agent config:

{
  "agent": {
    "metrics_collection_interval": 15,
    "run_as_user": "root",
    "debug": true
  },
  "logs": {
    "metrics_collected": {
      "otlp": {
        "http_endpoint": "127.0.0.1:1234"
      }
    }
  }
}

Restart agent to run translator, then manually update agent otel config since json config doesn't support setting detailed metrics:

exporters:
    awsemf:
..
        detailed_metrics: true
...

Start the agent without running translator by executing the binary directly:

sudo /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent -config /opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.toml -envconfig /opt/aws/amazon-cloudwatch-agent/etc/env-config.json -otelconfig /opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.yaml -pidfile /opt/aws/amazon-cloudwatch-agent/var/amazon-cloudwatch-agent.pid &

Run the otlp_pusher.sh

while true; export START_TIME=$(date +%s%N); do
  cat ./resources/otlp_metrics.json | sed -e "s/START_TIME/$START_TIME/" > otlp_metrics.json;
  curl -H 'Content-Type: application/json' -d @otlp_metrics.json -i http://127.0.0.1:1234/v1/metrics --verbose;
sleep 5; don

Check CloudWatch logs

Before changes (5 log events)
{
    "CloudWatchMetrics": [
        {
            "Namespace": "CWAgent",
            "Dimensions": [
                [
                    "my.summary.attr",
                    "service.name",
                    "quantile"
                ]
            ],
            "Metrics": [
                {
                    "Name": "my.summary",
                    "Unit": "Milliseconds",
                    "StorageResolution": 60
                }
            ]
        }
    ],
    "Timestamp": "1746812484637",
    "Version": "0",
    "my.summary.attr": "some value",
    "quantile": "0.5",
    "service.name": "my.service",
    "my.summary": 43.5
}
{
    "CloudWatchMetrics": [
        {
            "Namespace": "CWAgent",
            "Dimensions": [
                [
                    "my.summary.attr",
                    "service.name",
                    "quantile"
                ]
            ],
            "Metrics": [
                {
                    "Name": "my.summary",
                    "Unit": "Milliseconds",
                    "StorageResolution": 60
                }
            ]
        }
    ],
    "Timestamp": "1746812484637",
    "Version": "0",
    "my.summary.attr": "some value",
    "quantile": "0.95",
    "service.name": "my.service",
    "my.summary": 95.2
}
{
    "CloudWatchMetrics": [
        {
            "Namespace": "CWAgent",
            "Dimensions": [
                [
                    "my.summary.attr",
                    "service.name",
                    "quantile"
                ]
            ],
            "Metrics": [
                {
                    "Name": "my.summary",
                    "Unit": "Milliseconds",
                    "StorageResolution": 60
                }
            ]
        }
    ],
    "Timestamp": "1746812484637",
    "Version": "0",
    "my.summary.attr": "some value",
    "quantile": "0.99",
    "service.name": "my.service",
    "my.summary": 97.8
}
{
    "CloudWatchMetrics": [
        {
            "Namespace": "CWAgent",
            "Dimensions": [
                [
                    "my.summary.attr",
                    "service.name"
                ]
            ],
            "Metrics": [
                {
                    "Name": "my.summary_count",
                    "Unit": "Milliseconds",
                    "StorageResolution": 60
                }
            ]
        }
    ],
    "Timestamp": "1746812484637",
    "Version": "0",
    "my.summary.attr": "some value",
    "service.name": "my.service",
    "my.summary_count": 100
}
{
    "CloudWatchMetrics": [
        {
            "Namespace": "CWAgent",
            "Dimensions": [
                [
                    "service.name",
                    "my.summary.attr"
                ]
            ],
            "Metrics": [
                {
                    "Name": "my.summary_sum",
                    "Unit": "Milliseconds",
                    "StorageResolution": 60
                }
            ]
        }
    ],
    "Timestamp": "1746812484637",
    "Version": "0",
    "my.summary.attr": "some value",
    "service.name": "my.service",
    "my.summary_sum": 5000
}
After changes (4 logs events)
{
    "CloudWatchMetrics": [
        {
            "Namespace": "CWAgent",
            "Dimensions": [
                [
                    "my.summary.attr",
                    "service.name"
                ]
            ],
            "Metrics": [
                {
                    "Name": "my.summary_sum",
                    "Unit": "Milliseconds",
                    "StorageResolution": 60
                },
                {
                    "Name": "my.summary_count",
                    "Unit": "Milliseconds",
                    "StorageResolution": 60
                }
            ]
        }
    ],
    "Timestamp": "1746812680212",
    "Version": "0",
    "my.summary.attr": "some value",
    "service.name": "my.service",
    "my.summary_count": 100,
    "my.summary_sum": 5000
}
{
    "CloudWatchMetrics": [
        {
            "Namespace": "CWAgent",
            "Dimensions": [
                [
                    "my.summary.attr",
                    "service.name",
                    "quantile"
                ]
            ],
            "Metrics": [
                {
                    "Name": "my.summary",
                    "Unit": "Milliseconds",
                    "StorageResolution": 60
                }
            ]
        }
    ],
    "Timestamp": "1746812680212",
    "Version": "0",
    "my.summary.attr": "some value",
    "quantile": "0.5",
    "service.name": "my.service",
    "my.summary": 43.5
}
{
    "CloudWatchMetrics": [
        {
            "Namespace": "CWAgent",
            "Dimensions": [
                [
                    "quantile",
                    "my.summary.attr",
                    "service.name"
                ]
            ],
            "Metrics": [
                {
                    "Name": "my.summary",
                    "Unit": "Milliseconds",
                    "StorageResolution": 60
                }
            ]
        }
    ],
    "Timestamp": "1746812680212",
    "Version": "0",
    "my.summary.attr": "some value",
    "quantile": "0.95",
    "service.name": "my.service",
    "my.summary": 95.2
}
{
    "CloudWatchMetrics": [
        {
            "Namespace": "CWAgent",
            "Dimensions": [
                [
                    "my.summary.attr",
                    "service.name",
                    "quantile"
                ]
            ],
            "Metrics": [
                {
                    "Name": "my.summary",
                    "Unit": "Milliseconds",
                    "StorageResolution": 60
                }
            ]
        }
    ],
    "Timestamp": "1746812680212",
    "Version": "0",
    "my.summary.attr": "some value",
    "quantile": "0.99",
    "service.name": "my.service",
    "my.summary": 97.8
}

Documentation

@dricross dricross requested a review from mxiamxia as a code owner May 9, 2025 18:02
@@ -51,13 +51,13 @@ func addToGroupedMetric(
}
continue
}
dps, retained := dps.CalculateDeltaDatapoints(i, metadata.instrumentationScopeName, config.DetailedMetrics, calculators)
ddps, retained := dps.CalculateDeltaDatapoints(i, metadata.instrumentationScopeName, config.DetailedMetrics, calculators)
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The variable name dps here shadows the variable created previously on line 39. Changed to ddps for "delta datapoints"

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd avoid changing variables names from upstream since it can cause merge conflicts -- https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/exporter/awsemfexporter/grouped_metric.go#L39

I can see why this is confusing though

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah yeah I forgot it'll cause conflicts. ugh. this was so confusing when I was trying to figure out this bug. i'll revert the name changes

Comment on lines 89 to 93
if !(metadata.groupedMetricMetadata.metricDataType == pmetric.MetricTypeSummary && config.DetailedMetrics) {
// Summary metrics can be split into separate datapoints when using DetailedMetrics, but we still want to group
// them together into one EMF log event, so don't set batchIndex when it's a summary metric
metadata.groupedMetricMetadata.batchIndex = j
}
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the bug fix

if !retained {
continue
}

for i, dp := range dps {
labels := dp.labels
for j, ddp := range ddps {
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The variable name i here shadows the variable created previously on line 44.

dp -> ddp to mirror dps -> ddps change.

@@ -51,13 +51,13 @@ func addToGroupedMetric(
}
continue
}
dps, retained := dps.CalculateDeltaDatapoints(i, metadata.instrumentationScopeName, config.DetailedMetrics, calculators)
ddps, retained := dps.CalculateDeltaDatapoints(i, metadata.instrumentationScopeName, config.DetailedMetrics, calculators)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd avoid changing variables names from upstream since it can cause merge conflicts -- https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/exporter/awsemfexporter/grouped_metric.go#L39

I can see why this is confusing though

}

// Extra params to use when grouping metrics
metadata.groupedMetricMetadata.batchIndex = i
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why remove this? Will we break appsignals since they are the ones who added this?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It wasn't removed, just moved inside the if statement to skip this part for detailed summary metrics (for which the assumptions that were made when this when introduced does not hold).

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree this code is confusing so what's happening is to skip index changing for summary and when detailedMetrics is enabled. Is there any chance that these metrics are sparse in the array meaning keeping the index would cause any unexepcted grouping behavior?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any chance that these metrics are sparse in the array meaning keeping the index would cause any unexepcted grouping behavior

I don't think so. Here is how the array of datapoints is created: https://github.com/amazon-contributing/opentelemetry-collector-contrib/blob/bump-v0.124.1/exporter/awsemfexporter/datapoint.go#L483-L494. There' always two array entries for sum and count metric (that we want grouped) and then one datapoint for each quantile. We want the quantile entries to be separated, but they are already are because of the labels are different.

@dricross dricross force-pushed the dricross/detailedsummarymetrics branch from ced70e2 to 9580996 Compare May 12, 2025 12:35
@dricross dricross merged commit fa9736b into aws-cwa-dev May 13, 2025
127 of 147 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants