Skip to content

Custom Record Grouping #297

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
mkeskells opened this issue Sep 25, 2024 · 3 comments
Open

Custom Record Grouping #297

mkeskells opened this issue Sep 25, 2024 · 3 comments

Comments

@mkeskells
Copy link
Contributor

We have a requirement to do some grouping based on

  1. a timestamp in the payload (not current time of the kafka envelope time)
  2. other custom fields in the payload

This is for Avro records going the GCS (for us) but clearly could be generalised

What I was thinking is to bypass the RecordGroupingFactory to allow for a custom record grouping to be defined, probably by a configuration parameter
So mostly a change to AivenCommonConfig to enable the parameter, RecordGroupingFactory to observe it

I am happy to submit a PR if this looks to be the "right" way to do this, but I am new to this library and happy to take a steer first

@mkeskells
Copy link
Contributor Author

for the timestamp source - something like #298

Its missing tests, documentation etc, and the extractor for a data path but should be an indication of the way that I am thinking

I added several timestamp sources, maybe more than we need, but once the door is opened ...

@mkeskells
Copy link
Contributor Author

mkeskells commented Sep 27, 2024

for the custom grouper draft PR is #300
I have added some tests and check it works for our use case next week, but would appreciate any feedback or re-steer

@Claudenw
Copy link
Contributor

@mkeskells we have recently completed a source framework and an implementation on S3. I would like to try to apply the same approach to the sink frameworks where we create simple components that we can string together to take data from Kafka, stream it through a series of transforms, and writhe the results to a backend.

I have not spent a lot of time looking at the sink process yet, but it feels to me like your timestamp and grouping work would fall into commons. and that the GCS sink would then make use of it. That, if properly structured, the timestamp and grouping should be applicable to S3, Azure blob, and potentially various DBs.

I believe that you have spent more time looking at sink than I have. Does this assessment sound reasonable to you? Is there something that stands out as not correct?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants