Skip to content

[Variant] Add variant_get compute kernel #7919

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Jul 16, 2025
Merged

Conversation

Samyak2
Copy link
Contributor

@Samyak2 Samyak2 commented Jul 12, 2025

Which issue does this PR close?

What changes are included in this PR?

In parquet-variant:

  • Add a new function Variant::get_path: this traverses the path to create a new Variant (does not cast any of it).
  • Add a new module parquet_variant::path: adds structs/enums to define a path to access a variant value deeply.

In parquet-variant-compute:

  • Add a new compute kernel variant_get: does the path traversal over a VariantArray. In the future, this would also cast the values to a specified type.
  • Includes some basic unit tests. Not comprehensive.
  • Includes a simple micro-benchmark for reference.

Current limitations:

Are these changes tested?

Some basic unit tests are added.

Are there any user-facing changes?

Yes

@github-actions github-actions bot added the parquet Changes to the parquet crate label Jul 12, 2025
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Samyak2 -- this is looking like a great start. I left some ideas on how to break it up / make progress with smaller PRs.

fyi @scovich and @friendlymatthew

use crate::utils::variant_from_struct_array;

/// Returns an array with the specified path extracted from the variant values.
pub fn variant_get(input: &ArrayRef, options: GetOptions) -> Result<ArrayRef> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While trying to review this PR, one thing that I thought might help could be to separate out the Variant traversal and the construction of the output

For example, perhaps you could implement a function to find the appropriate sub Variant like

impl Variant {
  fn get_path(&self, path: &VariantPath) -> Option<Variant> {
    ...
  }
}

With that building block, you could implement the basic "extract a variant" kernel to a new VariantArray using the newly introduced VariantArrayBuilder

let mut output_builder = VariantArrayBuilder::new();
// loop over each input row
for input_variant in input_variant_array.iter() {
  // copy the input variant to the output builder
  let mut vb = VariantBuilder::new()
  vb.append(input_variant)
  let (metadata, value) = vb.build();
  output_builder.append_buffers()
}

I think once we get #7914 from @friendlymatthew in that should work

One downside of this approach is that it will copy the variant over and if the source and destinations are both BinaryViewArray we can probably be much more clever

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds like the "row-based" approach I mentioned in the description. The reason I did it this way: it looks more like a "columnar" approach. I assumed perf would be better. Perhaps we should benchmark both the approaches? I'll see what I can do

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree the row based approach will likely be slower for non-shredded variants, but it will always potentially be needed in some cases (for example when the source arrays are not BinaryView)

If we have the get_path method, I think we can potentially implement fast copies of variants by playing games with pointers -- basically by checking if the return variant has a pointer into the same buffer of the BinaryViewArray we can make a view that points there.

However, i think that will be somewhat tricky and require some unsafe so I suggest we get the first version in plae that does the copy, and once we have it working (and tests written, etc) then I think we'll be in a better position to optimize

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. That makes sense. I'll clean up the PR.

but it will always potentially be needed in some cases (for example when the source arrays are not BinaryView)

Not sure I follow. Perhaps worth a separate discussion in a follow-up issue.

@Samyak2 Samyak2 force-pushed the variant-get branch 2 times, most recently from 0925a6a to 184e1bf Compare July 13, 2025 19:00
@Samyak2 Samyak2 marked this pull request as ready for review July 13, 2025 19:03
@Samyak2
Copy link
Contributor Author

Samyak2 commented Jul 13, 2025

I have made the changes, rebased and squashed. Also updated the description to reflect the current implementation. I think it is ready for review. Please take a look!

My next steps: I will add some more unit tests if I get time over the next couple of days. In case the PR is merged before that, I'll open a separate PR for the tests.

@Samyak2 Samyak2 changed the title <WIP> variant_get compute kernel Add variant_get compute kernel Jul 13, 2025
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @Samyak2 -- I think this is a great first step. I agree there are a bunch of TODOs but I also think we could merge this as is and then iterate on the output

I am curious about your thoughts on the API design @scovich and @friendlymatthew

cc @PinkCrow007 @mprammer


#[test]
#[should_panic(
expected = "Nested values are handled specially by ObjectBuilder and ListBuilder"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 yeah I think once we get this PR in things will be much easier

@alamb alamb changed the title Add variant_get compute kernel [Variant] Add variant_get compute kernel Jul 14, 2025
@Samyak2 Samyak2 force-pushed the variant-get branch 2 times, most recently from dc5d138 to 4a22812 Compare July 14, 2025 17:15
@Samyak2
Copy link
Contributor Author

Samyak2 commented Jul 14, 2025

Made the following changes:

  • Moved to a non-owned Cow string like @alamb originally suggested in the issue
  • Added more tests
  • Fixed formatting and missing license header issues

Copy link
Contributor

@scovich scovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like a good step forward.

I'm still on the fence whether it's better to store (and traverse) a separate path for each desired leaf (e.g. when shredding to a specific schema), but for a single-field "get" this seems like the obvious right approach.

Comment on lines 53 to 57
let new_variant = new_variant.get_path(&options.path);
if let Some(new_variant) = new_variant {
// TODO: we're decoding the value and doing a copy into a variant value again. This
// copy can be much smarter.
builder.append_variant(new_variant);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we just do a directly byte-slice copy? The input is either fully validated or the caller has asserted that the bytes are correct. Maybe a VariantArray::value_bytes method and then VariantArrayBuilder::append_variant_bytes? Hopefully, that would also (eventually) map more nicely to a view-based approach.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be a better way, but I don't think that's possible here? new_variant here is of type Variant, which is a decoded variant value. So it doesn't really retain the raw bytes of the metadata/value from which it was constructed. It does retain the metadata and value bytes in case of objects/lists, but that case isn't working now anyway.

One way would be to operate on metadata/value byte slices directly, even for the get_path operation. That way, we'll be left with a value byte slice that we can copy directly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I tried this out for copying the buffers directly in case of list/object. We're still decoding and encoding in case of top-level primitives though.

It's in a separate branch for now. The test is failing because equality on variant is also checking for metadata buffer equality. Here it is: Samyak2@00bd07d

If this is what you meant, I'll find a way to make the tests pass and add it here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that didn't really change anything, unless the builder code to append a Variant::Object is somehow inefficient.
As you said, we'd need to rework the whole path to not instantiate the Variant instance get_path ends on, which would add significant complexity. Probably not worth the trouble at this point.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unless the builder code to append a Variant::Object is somehow inefficient

It doesn't exist at this point :)

But yeah, I'm guessing it would do the same thing. I have not reviewed that PR yet.

In parquet-variant:
- Add a new function `Variant::get_path`: this traverses the path to create a new Variant (does not cast any of it).
- Add a new module `parquet_variant::path`: adds structs/enums to define a path to access a variant value deeply.

In parquet-variant-compute:
- Add a new compute kernel `variant_get`: does the path traversal over a `VariantArray`. In the future, this would also cast the values to a specified type.
- Includes some basic unit tests.
- Includes a simple micro-benchmark for reference.

Current limitations:
- It can only return another VariantArray. Casts are not implemented yet.
- Only top-level object/list access is supported. It panics on finding a nested object/list. Needs apache#7914 to fix this.
- Perf is a TODO.
@Samyak2
Copy link
Contributor Author

Samyak2 commented Jul 15, 2025

I think I have incorporated all the changes. Let me know if I missed anything.


let mut builder = VariantArrayBuilder::new(variant_array.len());
for i in 0..variant_array.len() {
let new_variant = variant_array.value(i);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just realized something -- when dealing with shredded variants, this value method will do a lot of work to unshred and encode the whole thing (see e.g. #7915 (comment)). And that work is not memoized anywhere unless the caller is able to do so. For efficiency reasons we should strongly consider some kind of direct pathing support in VariantArray itself. Otherwise, it would be far too easy for a caller to accidentally do quadratic work when repeatedly calling value+get_path pairs for different paths.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to understand this -- is this what you're suggesting?

  • We add a new get_path method in VariantArray that does this:
    • For each row, look up the type of the variant and perform the pathing without decoding. So, for example, if it's a VariantObject and there's a VariantPathElement::Field path, it would get the offset for the given field (not sure how yet) and advance the value slice by that much.
    • We would then create a new VariantArray with the metadata directly copied over and the value copied starting from the advanced slice.
    • For shredded variants, if the path ends up on a shredded value, what would be the expected behavior? I'm guessing that the shredded fields will be represented as an Array of the concrete type (an Int32Array for example) and not a VariantArray. Will we wrap these in a VariantArray and send it back? This is one case where having the path + cast in the same operation would help.
  • This variant_get would then simply re-use VariantArray::get_path and perform the appropriate cast.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For efficiency reasons we should strongly consider some kind of direct pathing support in VariantArray itself. Otherwise, it would be far too easy for a caller to accidentally do quadratic work when repeatedly calling value+get_path pairs for different paths.

I think @scovich is saying that the variant_get kernel (on VariantArray should have a special case that knows how to look for a shredded sub field -- and if for example it is asking for a and the the typed_value.a column exists, variant_get could simply return that a column (already as an arrow array, no actual Variant manipulation required)

Comment on lines 73 to 75
/// if `as_type` is `Some(type)` the field is returned as the specified type if possible. To specify returning
/// a Variant, pass a Field with variant type in the metadata.
pub as_type: Option<Field>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought None was the way to request a variant result?

But meanwhile -- I tend to agree with @alamb that we should separate out casting from pathing. If get_path anyway has to return a Variant, somebody (ie arrow array builder) who actually cares about the type will still have to match on the result even if they just panic in the "impossible" case of an unexpected enum variant. Seems better to let the caller do the casting themselves since they probably have a strongly-typed place to put the result.

As a bonus, that would also mean we don't need to define GetOptions at this low level -- get_path only needs to know about VariantPath. Arrow variant compute and the array builders could worry about the rest of the options space.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But meanwhile -- I tend to agree with @alamb that we should separate out casting from pathing. If get_path anyway has to return a Variant, somebody (ie arrow array builder) who actually cares about the type will still have to match on the result even if they just panic in the "impossible" case of an unexpected enum variant. Seems better to let the caller do the casting themselves since they probably have a strongly-typed place to put the result.

This makes sense. But I'm not sure what you're suggesting here? Should we have a version of variant_get here that doesn't do casting and always returns a VariantArray?

As a bonus, that would also mean we don't need to define GetOptions at this low level -- get_path only needs to know about VariantPath. Arrow variant compute and the array builders could worry about the rest of the options space.

Variant::get_path already only knows about VariantPath. GetOptions is defined in this parquet-variant-compute crate. Or are you suggesting that we add a get_path method to VariantArray?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought None was the way to request a variant result?

Right. I had copied that comment from the issue and forgot to update it.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Samyak2 -- I am going to try and get #7914 merged and then will get this one merged too


let mut builder = VariantArrayBuilder::new(variant_array.len());
for i in 0..variant_array.len() {
let new_variant = variant_array.value(i);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For efficiency reasons we should strongly consider some kind of direct pathing support in VariantArray itself. Otherwise, it would be far too easy for a caller to accidentally do quadratic work when repeatedly calling value+get_path pairs for different paths.

I think @scovich is saying that the variant_get kernel (on VariantArray should have a special case that knows how to look for a shredded sub field -- and if for example it is asking for a and the the typed_value.a column exists, variant_get could simply return that a column (already as an arrow array, no actual Variant manipulation required)

}
}

/// Element of a path
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is lovely

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I... don't remember writing this lol
I'll fix this

/// Return a new Variant with the path followed.
///
/// If the path is not found, `None` is returned.
pub fn get_path(&self, path: &VariantPath) -> Option<Variant> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@alamb
Copy link
Contributor

alamb commented Jul 16, 2025

I merged up from main and fixed a logical conflict. Thanks again @Samyak2 -- once this PR passes CI I'll merge it in

@alamb
Copy link
Contributor

alamb commented Jul 16, 2025

Close/open to trigger CI

@alamb alamb closed this Jul 16, 2025
@alamb alamb reopened this Jul 16, 2025
@alamb
Copy link
Contributor

alamb commented Jul 16, 2025

🚀

@alamb alamb merged commit d4c0a32 into apache:main Jul 16, 2025
17 of 25 checks passed
@alamb
Copy link
Contributor

alamb commented Jul 16, 2025

Thanks again @Samyak2 -- this is a great initial PR and a good start to variant_get

@alamb
Copy link
Contributor

alamb commented Jul 16, 2025

I made a small follow up PR here:
-#7942

@alamb
Copy link
Contributor

alamb commented Jul 16, 2025

And it seems @carpecodeum is already using this PR as a base 🥳

alamb added a commit that referenced this pull request Jul 17, 2025
…ath (#7942)

# Which issue does this PR close?

We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax.

- Follow on to #7919

# Rationale for this change

While reviewing #7919 from
@Samyak2 I found I wanted to write some additional tests directly for
`Variant::get_path`

When I started doing that I found it was somewhat awkward to write
examples, so I added some new conversion routines to make it easier.

# What changes are included in this PR?

1. Add doc examples (and thus tests) of `VaraintGet` and `VariantPath`
2. Add more documentation

# Are these changes tested?
Yes, by doc examples which run in CI
# Are there any user-facing changes?

If there are user-facing changes then we may require documentation to be
updated before approving the PR.

If there are any breaking changes to public APIs, please call them out.
alamb added a commit that referenced this pull request Jul 18, 2025
# Which issue does this PR close?

We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax.

- Part of #7911
- Part of  #6736
- Follow on to #7905


# Rationale for this change

I wrote benchmark some changes to the json decoder in
#7911 but they are non trivial.
To keep #7911 easier to review I
have pulled the benchmarks out to their own PR

# What changes are included in this PR?

1. Add new json  benchmark
2. Include the `variant_get` benchmark added in
#7919 by @Samyak2
# Are these changes tested?

I tested them manually and clippy CI coverage ensures they compile

# Are there any user-facing changes?

No these are only benchmarks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Variant][Compute] variant_get kernel
3 participants