Skip to content

Conversation

@tommyers-elastic
Copy link
Contributor

@tommyers-elastic tommyers-elastic commented Sep 23, 2025

Description

Adds an additional way to set dataset based on scope attributes set by econding extensions. The main use case we are supporting with this right now is awslogsencodingextension which sets encoding.format like aws.cloudtrail, aws.vpcflow, aws.elbaccess. See #42899 for more detail.

Link to tracking issue

Fixes

Testing

Unit test added.

Documentation

README updated.

// 3. receiver-based routing
// 4. use default hardcoded data_stream.*
// 3. extension-based routing
// 4. receiver-based routing

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the expected usage that extension-based routing is prioritized before receiver-based?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeh, this was my hunch - i figured you could have a receiver e.g. filelog, with additional per-log-type encoding type annotations that are more specific.

@MichaelKatsoulis
Copy link
Contributor

@tommyers-elastic the relevant encoding extension PR is approved and waiting to get merged. What are the next steps on this one?

@tommyers-elastic
Copy link
Contributor Author

@MichaelKatsoulis i updated and refactored a little. ready for review now.

@tommyers-elastic tommyers-elastic changed the title [elasticsearchexporter] Add dataset rule for awslogsencodingextension using scope attributes [elasticsearchexporter] Support 'encoding.format' scope attribute for dataset routing Sep 26, 2025
Copy link
Contributor

@carsonip carsonip left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, lgtm.

minor Q: do we want this encoding.format in the final document? I assume it is fine to keep them, which is what is happening now. Otherwise, to exclude them, it'll be some changes in exporter/elasticsearchexporter/internal/serializer/otelserializer/common.go:53

@tommyers-elastic
Copy link
Contributor Author

do we want this encoding.format in the final document?

not sure. i think it probably makes sense to keep it

Copy link
Contributor

@carsonip carsonip left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, thanks!

1. `data_stream.dataset` or `data_stream.namespace` in attributes (precedence: log record / data point / span attribute > scope attribute > resource attribute)
2. Otherwise, if scope name matches regex `/receiver/(\w*receiver)`, `data_stream.dataset` will be capture group #1
3. Otherwise, `data_stream.dataset` falls back to `generic` and `data_stream.namespace` falls back to `default`.
2. Otherwise, if a scope attribute with the name `encoding.format` exists and contains a string value, `data_stream.dataset` will be set to this value.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't the addition of this quite arbitrary? What if users want to use another scope/resource attribute for routing? Also what if there are components that already populate this attribute? Hardcoding this here will break the routing for them, right? In this, why not to make this configurable so as users can define the routing rule based on whatever they would like to i.e. routing_attribute: resource.attributes["someattr"] (similarly to ottl)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can't they already effectively do this just by using the explicit data_stream.* attributes?

Also what if there are components that already populate this attribute?

this is a fair point, but we're not currently aware of any.

why not to make this configurable so as users can define the routing rule based on whatever they would like

the intention here is to provide a default behaviour so encoding extension + es exporter works OOTB.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can't they already effectively do this just by using the explicit data_stream.* attributes?

No without explicitly setting these. Users might want to avoid setting them for several reasons, and instead route based on existing attributes.

the intention here is to provide a default behaviour so encoding extension + es exporter works OOTB.

My concern here is about if a generic exporter like this one, should be coupled that much to specific components like the one you mention. To me this looks weird and isn't really future-proof/extensible. That's why I suggested making this configurable.

However, since this has been approved by a code-owner you can dismiss my review if others agree with proceeding (consider marking this as a breaking change).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tend to agree that document routing in the Elasticsearch exporter is already complex and is getting even more complex with this change. This makes it harder for users to understand what's going on by just looking at the config or by looking at the data that landed in Elasticsearch. Questions like "Where's my data?" or "Why was my data routed to this index?" are increasingly harder to answer.

I understand the need to for the component to "work out of the box". The problem I see with this is that everyone seems to have their own, different idea of what "out of the box" behavior looks like.

This change proposes to use the encoding.format scope attribute to route documents to Elasticsearch. This attribute is currently only set by one (1) of the ten (10) encoding extensions in Contrib, making it not quite as general as one might assume.

It would be much much simpler if the Elasticsearch exporter only routed based on the data_stream.* attributes. Those attributes may be set in a processor based on any other attribute or scope name. I would prefer this approach.

Having said all this, I'll defer to the component's code owners on this, at least for this change.

Copy link
Member

@lahsivjar lahsivjar Oct 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated myself on the issue and from what I understand so far, there is already an issue to drive upstream semconv to include encoding.format as scope attributes. If/when we have this, if I understand correctly, we would still end up handling this in the ES exporter. In this sense, it is somewhat of an experimental addition - @tommyers-elastic, maybe we can mark it experimental and subject to change until the semconv goes forward?

Copy link
Member

@lahsivjar lahsivjar Oct 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding treating this as a breaking change, I don't have an opinion, but we should surely have this mentioned in the changelog in some format so that users who might have this attribute in their pipeline are not surprised by the side effects.

Copy link
Contributor

@carsonip carsonip Oct 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Zooming out a bit. Let's consider data_stream.dataset handling is Elasticsearch specific, and ES exporter would like to implement routing based on a standard. And encoding.format would be a proposed standard that es exporter is willing to implement as an experimental feature. I see this as a healthy addition that makes es exporter more operable with other components without requiring the other components to implement an elasticsearch specific attribute, or requiring the user to add a processor just to glue 2 components together.

On the topic of configurable routing attribute, making it a configuration won't make routing less complex. Not all routing are attribute based, e.g. telemetry scope name routing that requires parsing. Not to mention the order between the routing rules themselves. Probably YAGNI and we can refactor the code to make it configurable in the future if we really see the need.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@carsonip do you consider the whole routing behavior as subject to change? If so, that should be documented imo.

Regarding the current addition, if that's experimental as @lahsivjar mentioned, how about adding this behind a feature gate? That's a common practice for the Collector project when it comes to experimental features.

A slightly less ideal alternative to the feature gate would be to just clearly document that the added routing rule is experimental. Though that's a bit weird when it affects the whole default routing behavior 🤔 .

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that any changes to the routing logic also need to be applied to the Elasticsearch OTLP endpoint.
For this proposed change, I’m a bit worried that the attribute is too specific about encoding extensions. It feels like a slippery slope for adding more similar routing attributes. Can we instead either set an appropriate scope name in the encoding extensions or introduce a more generic concept, something like a dataset or schema name that describes the shape of the data that we can use for routing?

Copy link
Contributor

@carsonip carsonip left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

putting this PR on hold as we realign on the e2e design

Copy link
Contributor

@carsonip carsonip left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reapproving this PR as this PR is revived as an outcome of realignment

Copy link
Member

@andrzej-stencel andrzej-stencel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code owners decided to use the encoding.format for now.

@andrzej-stencel andrzej-stencel merged commit c15e845 into open-telemetry:main Oct 13, 2025
240 of 248 checks passed
@github-actions github-actions bot added this to the next release milestone Oct 13, 2025
@otelbot
Copy link
Contributor

otelbot bot commented Oct 13, 2025

Thank you for your contribution @tommyers-elastic! 🎉 We would like to hear from you about your experience contributing to OpenTelemetry by taking a few minutes to fill out this survey. If you are getting started contributing, you can also join the CNCF Slack channel #opentelemetry-new-contributors to ask for guidance and get help.

ChrsMark pushed a commit to ChrsMark/opentelemetry-collector-contrib that referenced this pull request Oct 20, 2025
… dataset routing (open-telemetry#42844)

<!--Ex. Fixing a bug - Describe the bug and how this fixes the issue.
Ex. Adding a feature - Explain what this achieves.-->
#### Description

Adds an additional way to set dataset based on scope attributes set by
econding extensions. The main use case we are supporting with this right
now is awslogsencodingextension which sets `encoding.format` like
`aws.cloudtrail`, `aws.vpcflow`, `aws.elbaccess`. See
open-telemetry#42899
for more detail.

<!-- Issue number (e.g. open-telemetry#1234) or full URL to issue, if applicable. -->
#### Link to tracking issue
Fixes

<!--Describe what testing was performed and which tests were added.-->
#### Testing

Unit test added. 

<!--Describe the documentation added.-->
#### Documentation

README updated.

<!--Please delete paragraphs that you did not use before submitting.-->

---------

Co-authored-by: Carson Ip <carsonip@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.