-
Notifications
You must be signed in to change notification settings - Fork 3.1k
[elasticsearchexporter] Support 'encoding.format' scope attribute for dataset routing #42844
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[elasticsearchexporter] Support 'encoding.format' scope attribute for dataset routing #42844
Conversation
| // 3. receiver-based routing | ||
| // 4. use default hardcoded data_stream.* | ||
| // 3. extension-based routing | ||
| // 4. receiver-based routing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the expected usage that extension-based routing is prioritized before receiver-based?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeh, this was my hunch - i figured you could have a receiver e.g. filelog, with additional per-log-type encoding type annotations that are more specific.
|
@tommyers-elastic the relevant encoding extension PR is approved and waiting to get merged. What are the next steps on this one? |
|
@MichaelKatsoulis i updated and refactored a little. ready for review now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks, lgtm.
minor Q: do we want this encoding.format in the final document? I assume it is fine to keep them, which is what is happening now. Otherwise, to exclude them, it'll be some changes in exporter/elasticsearchexporter/internal/serializer/otelserializer/common.go:53
not sure. i think it probably makes sense to keep it |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm, thanks!
| 1. `data_stream.dataset` or `data_stream.namespace` in attributes (precedence: log record / data point / span attribute > scope attribute > resource attribute) | ||
| 2. Otherwise, if scope name matches regex `/receiver/(\w*receiver)`, `data_stream.dataset` will be capture group #1 | ||
| 3. Otherwise, `data_stream.dataset` falls back to `generic` and `data_stream.namespace` falls back to `default`. | ||
| 2. Otherwise, if a scope attribute with the name `encoding.format` exists and contains a string value, `data_stream.dataset` will be set to this value. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't the addition of this quite arbitrary? What if users want to use another scope/resource attribute for routing? Also what if there are components that already populate this attribute? Hardcoding this here will break the routing for them, right? In this, why not to make this configurable so as users can define the routing rule based on whatever they would like to i.e. routing_attribute: resource.attributes["someattr"] (similarly to ottl)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can't they already effectively do this just by using the explicit data_stream.* attributes?
Also what if there are components that already populate this attribute?
this is a fair point, but we're not currently aware of any.
why not to make this configurable so as users can define the routing rule based on whatever they would like
the intention here is to provide a default behaviour so encoding extension + es exporter works OOTB.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can't they already effectively do this just by using the explicit data_stream.* attributes?
No without explicitly setting these. Users might want to avoid setting them for several reasons, and instead route based on existing attributes.
the intention here is to provide a default behaviour so encoding extension + es exporter works OOTB.
My concern here is about if a generic exporter like this one, should be coupled that much to specific components like the one you mention. To me this looks weird and isn't really future-proof/extensible. That's why I suggested making this configurable.
However, since this has been approved by a code-owner you can dismiss my review if others agree with proceeding (consider marking this as a breaking change).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tend to agree that document routing in the Elasticsearch exporter is already complex and is getting even more complex with this change. This makes it harder for users to understand what's going on by just looking at the config or by looking at the data that landed in Elasticsearch. Questions like "Where's my data?" or "Why was my data routed to this index?" are increasingly harder to answer.
I understand the need to for the component to "work out of the box". The problem I see with this is that everyone seems to have their own, different idea of what "out of the box" behavior looks like.
This change proposes to use the encoding.format scope attribute to route documents to Elasticsearch. This attribute is currently only set by one (1) of the ten (10) encoding extensions in Contrib, making it not quite as general as one might assume.
It would be much much simpler if the Elasticsearch exporter only routed based on the data_stream.* attributes. Those attributes may be set in a processor based on any other attribute or scope name. I would prefer this approach.
Having said all this, I'll defer to the component's code owners on this, at least for this change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated myself on the issue and from what I understand so far, there is already an issue to drive upstream semconv to include encoding.format as scope attributes. If/when we have this, if I understand correctly, we would still end up handling this in the ES exporter. In this sense, it is somewhat of an experimental addition - @tommyers-elastic, maybe we can mark it experimental and subject to change until the semconv goes forward?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding treating this as a breaking change, I don't have an opinion, but we should surely have this mentioned in the changelog in some format so that users who might have this attribute in their pipeline are not surprised by the side effects.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Zooming out a bit. Let's consider data_stream.dataset handling is Elasticsearch specific, and ES exporter would like to implement routing based on a standard. And encoding.format would be a proposed standard that es exporter is willing to implement as an experimental feature. I see this as a healthy addition that makes es exporter more operable with other components without requiring the other components to implement an elasticsearch specific attribute, or requiring the user to add a processor just to glue 2 components together.
On the topic of configurable routing attribute, making it a configuration won't make routing less complex. Not all routing are attribute based, e.g. telemetry scope name routing that requires parsing. Not to mention the order between the routing rules themselves. Probably YAGNI and we can refactor the code to make it configurable in the future if we really see the need.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@carsonip do you consider the whole routing behavior as subject to change? If so, that should be documented imo.
Regarding the current addition, if that's experimental as @lahsivjar mentioned, how about adding this behind a feature gate? That's a common practice for the Collector project when it comes to experimental features.
A slightly less ideal alternative to the feature gate would be to just clearly document that the added routing rule is experimental. Though that's a bit weird when it affects the whole default routing behavior 🤔 .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that any changes to the routing logic also need to be applied to the Elasticsearch OTLP endpoint.
For this proposed change, I’m a bit worried that the attribute is too specific about encoding extensions. It feels like a slippery slope for adding more similar routing attributes. Can we instead either set an appropriate scope name in the encoding extensions or introduce a more generic concept, something like a dataset or schema name that describes the shape of the data that we can use for routing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
putting this PR on hold as we realign on the e2e design
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reapproving this PR as this PR is revived as an outcome of realignment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code owners decided to use the encoding.format for now.
c15e845
into
open-telemetry:main
|
Thank you for your contribution @tommyers-elastic! 🎉 We would like to hear from you about your experience contributing to OpenTelemetry by taking a few minutes to fill out this survey. If you are getting started contributing, you can also join the CNCF Slack channel #opentelemetry-new-contributors to ask for guidance and get help. |
… dataset routing (open-telemetry#42844) <!--Ex. Fixing a bug - Describe the bug and how this fixes the issue. Ex. Adding a feature - Explain what this achieves.--> #### Description Adds an additional way to set dataset based on scope attributes set by econding extensions. The main use case we are supporting with this right now is awslogsencodingextension which sets `encoding.format` like `aws.cloudtrail`, `aws.vpcflow`, `aws.elbaccess`. See open-telemetry#42899 for more detail. <!-- Issue number (e.g. open-telemetry#1234) or full URL to issue, if applicable. --> #### Link to tracking issue Fixes <!--Describe what testing was performed and which tests were added.--> #### Testing Unit test added. <!--Describe the documentation added.--> #### Documentation README updated. <!--Please delete paragraphs that you did not use before submitting.--> --------- Co-authored-by: Carson Ip <carsonip@users.noreply.github.com>
Description
Adds an additional way to set dataset based on scope attributes set by econding extensions. The main use case we are supporting with this right now is awslogsencodingextension which sets
encoding.formatlikeaws.cloudtrail,aws.vpcflow,aws.elbaccess. See #42899 for more detail.Link to tracking issue
Fixes
Testing
Unit test added.
Documentation
README updated.