-
Notifications
You must be signed in to change notification settings - Fork 976
[Variant] Add variant_get
compute kernel
#7919
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @Samyak2 -- this is looking like a great start. I left some ideas on how to break it up / make progress with smaller PRs.
fyi @scovich and @friendlymatthew
use crate::utils::variant_from_struct_array; | ||
|
||
/// Returns an array with the specified path extracted from the variant values. | ||
pub fn variant_get(input: &ArrayRef, options: GetOptions) -> Result<ArrayRef> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While trying to review this PR, one thing that I thought might help could be to separate out the Variant traversal and the construction of the output
For example, perhaps you could implement a function to find the appropriate sub Variant like
impl Variant {
fn get_path(&self, path: &VariantPath) -> Option<Variant> {
...
}
}
With that building block, you could implement the basic "extract a variant" kernel to a new VariantArray using the newly introduced VariantArrayBuilder
let mut output_builder = VariantArrayBuilder::new();
// loop over each input row
for input_variant in input_variant_array.iter() {
// copy the input variant to the output builder
let mut vb = VariantBuilder::new()
vb.append(input_variant)
let (metadata, value) = vb.build();
output_builder.append_buffers()
}
I think once we get #7914 from @friendlymatthew in that should work
One downside of this approach is that it will copy the variant over and if the source and destinations are both BinaryViewArray we can probably be much more clever
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That sounds like the "row-based" approach I mentioned in the description. The reason I did it this way: it looks more like a "columnar" approach. I assumed perf would be better. Perhaps we should benchmark both the approaches? I'll see what I can do
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree the row based approach will likely be slower for non-shredded variants, but it will always potentially be needed in some cases (for example when the source arrays are not BinaryView)
If we have the get_path
method, I think we can potentially implement fast copies of variants by playing games with pointers -- basically by checking if the return variant has a pointer into the same buffer of the BinaryViewArray we can make a view that points there.
However, i think that will be somewhat tricky and require some unsafe
so I suggest we get the first version in plae that does the copy, and once we have it working (and tests written, etc) then I think we'll be in a better position to optimize
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. That makes sense. I'll clean up the PR.
but it will always potentially be needed in some cases (for example when the source arrays are not BinaryView)
Not sure I follow. Perhaps worth a separate discussion in a follow-up issue.
0925a6a
to
184e1bf
Compare
I have made the changes, rebased and squashed. Also updated the description to reflect the current implementation. I think it is ready for review. Please take a look! My next steps: I will add some more unit tests if I get time over the next couple of days. In case the PR is merged before that, I'll open a separate PR for the tests. |
variant_get
compute kernel
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @Samyak2 -- I think this is a great first step. I agree there are a bunch of TODOs but I also think we could merge this as is and then iterate on the output
I am curious about your thoughts on the API design @scovich and @friendlymatthew
|
||
#[test] | ||
#[should_panic( | ||
expected = "Nested values are handled specially by ObjectBuilder and ListBuilder" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 yeah I think once we get this PR in things will be much easier
variant_get
compute kernelvariant_get
compute kernel
dc5d138
to
4a22812
Compare
Made the following changes:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like a good step forward.
I'm still on the fence whether it's better to store (and traverse) a separate path for each desired leaf (e.g. when shredding to a specific schema), but for a single-field "get" this seems like the obvious right approach.
let new_variant = new_variant.get_path(&options.path); | ||
if let Some(new_variant) = new_variant { | ||
// TODO: we're decoding the value and doing a copy into a variant value again. This | ||
// copy can be much smarter. | ||
builder.append_variant(new_variant); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't we just do a directly byte-slice copy? The input is either fully validated or the caller has asserted that the bytes are correct. Maybe a VariantArray::value_bytes
method and then VariantArrayBuilder::append_variant_bytes
? Hopefully, that would also (eventually) map more nicely to a view-based approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would be a better way, but I don't think that's possible here? new_variant
here is of type Variant
, which is a decoded variant value. So it doesn't really retain the raw bytes of the metadata/value from which it was constructed. It does retain the metadata and value bytes in case of objects/lists, but that case isn't working now anyway.
One way would be to operate on metadata/value byte slices directly, even for the get_path
operation. That way, we'll be left with a value
byte slice that we can copy directly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I tried this out for copying the buffers directly in case of list/object. We're still decoding and encoding in case of top-level primitives though.
It's in a separate branch for now. The test is failing because equality on variant is also checking for metadata buffer equality. Here it is: Samyak2@00bd07d
If this is what you meant, I'll find a way to make the tests pass and add it here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that didn't really change anything, unless the builder code to append a Variant::Object is somehow inefficient.
As you said, we'd need to rework the whole path to not instantiate the Variant
instance get_path
ends on, which would add significant complexity. Probably not worth the trouble at this point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unless the builder code to append a Variant::Object is somehow inefficient
It doesn't exist at this point :)
But yeah, I'm guessing it would do the same thing. I have not reviewed that PR yet.
In parquet-variant: - Add a new function `Variant::get_path`: this traverses the path to create a new Variant (does not cast any of it). - Add a new module `parquet_variant::path`: adds structs/enums to define a path to access a variant value deeply. In parquet-variant-compute: - Add a new compute kernel `variant_get`: does the path traversal over a `VariantArray`. In the future, this would also cast the values to a specified type. - Includes some basic unit tests. - Includes a simple micro-benchmark for reference. Current limitations: - It can only return another VariantArray. Casts are not implemented yet. - Only top-level object/list access is supported. It panics on finding a nested object/list. Needs apache#7914 to fix this. - Perf is a TODO.
I think I have incorporated all the changes. Let me know if I missed anything. |
|
||
let mut builder = VariantArrayBuilder::new(variant_array.len()); | ||
for i in 0..variant_array.len() { | ||
let new_variant = variant_array.value(i); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just realized something -- when dealing with shredded variants, this value
method will do a lot of work to unshred and encode the whole thing (see e.g. #7915 (comment)). And that work is not memoized anywhere unless the caller is able to do so. For efficiency reasons we should strongly consider some kind of direct pathing support in VariantArray
itself. Otherwise, it would be far too easy for a caller to accidentally do quadratic work when repeatedly calling value+get_path pairs for different paths.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to understand this -- is this what you're suggesting?
- We add a new
get_path
method inVariantArray
that does this:- For each row, look up the type of the variant and perform the pathing without decoding. So, for example, if it's a
VariantObject
and there's aVariantPathElement::Field
path, it would get the offset for the given field (not sure how yet) and advance thevalue
slice by that much. - We would then create a new
VariantArray
with the metadata directly copied over and the value copied starting from the advanced slice. - For shredded variants, if the path ends up on a shredded value, what would be the expected behavior? I'm guessing that the shredded fields will be represented as an Array of the concrete type (an
Int32Array
for example) and not aVariantArray
. Will we wrap these in aVariantArray
and send it back? This is one case where having the path + cast in the same operation would help.
- For each row, look up the type of the variant and perform the pathing without decoding. So, for example, if it's a
- This
variant_get
would then simply re-useVariantArray::get_path
and perform the appropriate cast.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For efficiency reasons we should strongly consider some kind of direct pathing support in VariantArray itself. Otherwise, it would be far too easy for a caller to accidentally do quadratic work when repeatedly calling value+get_path pairs for different paths.
I think @scovich is saying that the variant_get kernel (on VariantArray
should have a special case that knows how to look for a shredded sub field -- and if for example it is asking for a
and the the typed_value.a
column exists, variant_get could simply return that a
column (already as an arrow array, no actual Variant manipulation required)
- I filed [Variant] Support
variant_get
kernel for shredded variants #7941 to track this
/// if `as_type` is `Some(type)` the field is returned as the specified type if possible. To specify returning | ||
/// a Variant, pass a Field with variant type in the metadata. | ||
pub as_type: Option<Field>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought None
was the way to request a variant result?
But meanwhile -- I tend to agree with @alamb that we should separate out casting from pathing. If get_path
anyway has to return a Variant
, somebody (ie arrow array builder) who actually cares about the type will still have to match on the result even if they just panic in the "impossible" case of an unexpected enum variant. Seems better to let the caller do the casting themselves since they probably have a strongly-typed place to put the result.
As a bonus, that would also mean we don't need to define GetOptions
at this low level -- get_path
only needs to know about VariantPath
. Arrow variant compute and the array builders could worry about the rest of the options space.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But meanwhile -- I tend to agree with @alamb that we should separate out casting from pathing. If get_path anyway has to return a Variant, somebody (ie arrow array builder) who actually cares about the type will still have to match on the result even if they just panic in the "impossible" case of an unexpected enum variant. Seems better to let the caller do the casting themselves since they probably have a strongly-typed place to put the result.
This makes sense. But I'm not sure what you're suggesting here? Should we have a version of variant_get
here that doesn't do casting and always returns a VariantArray
?
As a bonus, that would also mean we don't need to define GetOptions at this low level -- get_path only needs to know about VariantPath. Arrow variant compute and the array builders could worry about the rest of the options space.
Variant::get_path
already only knows about VariantPath
. GetOptions
is defined in this parquet-variant-compute
crate. Or are you suggesting that we add a get_path
method to VariantArray
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought None was the way to request a variant result?
Right. I had copied that comment from the issue and forgot to update it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
||
let mut builder = VariantArrayBuilder::new(variant_array.len()); | ||
for i in 0..variant_array.len() { | ||
let new_variant = variant_array.value(i); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For efficiency reasons we should strongly consider some kind of direct pathing support in VariantArray itself. Otherwise, it would be far too easy for a caller to accidentally do quadratic work when repeatedly calling value+get_path pairs for different paths.
I think @scovich is saying that the variant_get kernel (on VariantArray
should have a special case that knows how to look for a shredded sub field -- and if for example it is asking for a
and the the typed_value.a
column exists, variant_get could simply return that a
column (already as an arrow array, no actual Variant manipulation required)
- I filed [Variant] Support
variant_get
kernel for shredded variants #7941 to track this
} | ||
} | ||
|
||
/// Element of a path |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is lovely
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I... don't remember writing this lol
I'll fix this
/// Return a new Variant with the path followed. | ||
/// | ||
/// If the path is not found, `None` is returned. | ||
pub fn get_path(&self, path: &VariantPath) -> Option<Variant> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
I merged up from main and fixed a logical conflict. Thanks again @Samyak2 -- once this PR passes CI I'll merge it in |
Close/open to trigger CI |
🚀 |
Thanks again @Samyak2 -- this is a great initial PR and a good start to variant_get |
I made a small follow up PR here: |
And it seems @carpecodeum is already using this PR as a base 🥳 |
…ath (#7942) # Which issue does this PR close? We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. - Follow on to #7919 # Rationale for this change While reviewing #7919 from @Samyak2 I found I wanted to write some additional tests directly for `Variant::get_path` When I started doing that I found it was somewhat awkward to write examples, so I added some new conversion routines to make it easier. # What changes are included in this PR? 1. Add doc examples (and thus tests) of `VaraintGet` and `VariantPath` 2. Add more documentation # Are these changes tested? Yes, by doc examples which run in CI # Are there any user-facing changes? If there are user-facing changes then we may require documentation to be updated before approving the PR. If there are any breaking changes to public APIs, please call them out.
# Which issue does this PR close? We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. - Part of #7911 - Part of #6736 - Follow on to #7905 # Rationale for this change I wrote benchmark some changes to the json decoder in #7911 but they are non trivial. To keep #7911 easier to review I have pulled the benchmarks out to their own PR # What changes are included in this PR? 1. Add new json benchmark 2. Include the `variant_get` benchmark added in #7919 by @Samyak2 # Are these changes tested? I tested them manually and clippy CI coverage ensures they compile # Are there any user-facing changes? No these are only benchmarks
Which issue does this PR close?
variant_get
kernel #7893What changes are included in this PR?
In parquet-variant:
Variant::get_path
: this traverses the path to create a new Variant (does not cast any of it).parquet_variant::path
: adds structs/enums to define a path to access a variant value deeply.In parquet-variant-compute:
variant_get
: does the path traversal over aVariantArray
. In the future, this would also cast the values to a specified type.Current limitations:
VariantBuilder
#7914 to fix this.Are these changes tested?
Some basic unit tests are added.
Are there any user-facing changes?
Yes