Skip to content

Conversation

@tscottcoombes1
Copy link
Contributor

@tscottcoombes1 tscottcoombes1 commented Nov 26, 2024

Rationale for this change

extra support for protobuf common types, converting to parquet

What changes are included in this PR?

support for protobuf struct and tests

Are these changes tested?

yes

Are there any user-facing changes?

no, just support for new type

@tscottcoombes1 tscottcoombes1 marked this pull request as draft November 26, 2024 15:12
@tscottcoombes1 tscottcoombes1 changed the title break the tests with struct type support protobuf struct type Nov 26, 2024
@tscottcoombes1
Copy link
Contributor Author

@zeroshade can I get your thoughts on this?

spark follows this pattern: https://spark.apache.org/docs/latest/sql-data-sources-protobuf.html#handling-circular-references-protobuf-fields

syntax = "proto3"
message Person {
  string name = 1;
  Person bff = 2
}

// The protobuf schema defined above, would be converted into a Spark SQL columns with the following
// structure based on `recursive.fields.max.depth` value.

0: struct<name: string, bff: null>
1: struct<name string, bff: <name: string, bff: null>>
2: struct<name string, bff: <name: string, bff: struct<name: string, bff: null>>> ...

but parquet can't do null types

my idea, is if we get to depth = x, then we drop the field? e.g. for 2

2: struct<name string, bff: <name: string, bff: struct<name: string>>>

@tscottcoombes1
Copy link
Contributor Author

@zeroshade otherwise, as this is used for json type fields, we could convert to a json string. and not need to think about recursive depths

@zeroshade
Copy link
Member

if the intent here is to use it for specifically JSON fields, then i would say we just use a string and utilize the Canonical JSON extension type so that it gets stored in parquet as a JSON typed field.

@tscottcoombes1 tscottcoombes1 marked this pull request as ready for review November 27, 2024 18:04
Copy link
Member

@zeroshade zeroshade left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good, just a few nit picks

Comment on lines +21 to +22
"google.golang.org/protobuf/encoding/protojson"
"google.golang.org/protobuf/types/known/structpb"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you shift these includes to be grouped with the others of the same prefix below?

Comment on lines 21 to +22
"fmt"
"google.golang.org/protobuf/types/known/structpb"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above, please place this with the sorted imports below

@zeroshade
Copy link
Member

@tscottcoombes1 Running pre-commit should be sufficient to fix the linting issues

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants