[Variant] VariantBuilder with VariantMetadata instead of MetadataBuilder #7915

scovich · 2025-07-11T23:03:43Z

Which issue does this PR close?

Pathfinding related to the unshredding aspects of

[Variant] Add low level support for shredding and unshredding #7715 (comment)

Other pathfinding relevant to unshredding, that this PR does not attempt:

Ability for a builder to wrap an existing buffer (which a variant array builder could use to pack multiple variants into the same slice of memory)
Ability to insert raw bytes of a variant as an object field or array element. For uniformity, the top-level builder may as well support the same.

Rationale for this change

The variant shredding spec requires that the top-level variant metadata already contains all field names, whether those were shredded or not. That means an unshredding operation must use the existing VariantMetadata for the column, and there's no point paying to build a new one.

What changes are included in this PR?

Pathfinding change, where MetadataBuilder is renamed as DefaultMetadataBuilder that implements a new MetadataBuilder trait, which VariantMetadata can also implement. VariantBuilder then takes a type parameter for the specific M: MetadataBuilder that it uses.

For convenience, VariantBuilder becomes a type alias for GenericVariantBuilder<DefaultMetadataBuilder>, so most callers don't need to care.

This change also requires several previously-infallible methods to become fallible, because attempting to insert a non-existent field name must fail.

Are these changes tested?

Existing tests should cover the refactoring.

A couple new tests lightly exercise the new read-only metadata.

Are there any user-facing changes?

A whole bunch:

Methods become fallible
MetadataBuilder is now generic and may not impl Default any more
etc.

…ment

scovich · 2025-07-11T23:04:35Z

Attn @alamb @friendlymatthew - I'd love to know your initial reactions to this change

scovich · 2025-07-12T17:09:38Z

parquet-variant/src/builder.rs

        if self.fields.insert(field_id, field_start).is_some() && self.validate_unique_fields {
            self.duplicate_fields.insert(field_id);
        }


Now that insert is fallible, we can fail-fast instead of tracking a set of duplicate fields for finish to worry about:

Suggested change

if self.fields.insert(field_id, field_start).is_some() && self.validate_unique_fields {

self.duplicate_fields.insert(field_id);

}

if self.fields.insert(field_id, field_start).is_some() && self.validate_unique_fields {

return Err(... duplicate field {key} ...);

}

alamb · 2025-07-12T19:14:13Z

Ability for a builder to wrap an existing buffer (which a variant array builder could use to pack multiple variants into the same slice of memory)

I have some proposal of that API in this PR

[Variant] Add VariantBuilder::new_with_buffers to write to existing buffers #7912

alamb

Thank you @scovich -- it is neat to see progress towards shredding

I don't fully understand where you are heading with this PR -- specifically it seems like the assumption in this PR is that we would use a VariantBuilder to recreate a shredded Variant.

What I was imagining is that we are able to return a Variant that is a read only view into the underlying buffer(s) from VariantArray::value() -- we may have to add to the Variant enum like Variant::ShreddedObject or something

However, if we can figure that out, then I think the "unshredding" operaton simply looks like calling VariantArray::value() and then making a builder to write into a new VariantArray (without shredding)

Now that @friendlymatthew is close to supporting appending complex variants directly in the builder, I think it could be quite elegant

#7914

Maybe once we merge #7905 we could work on figuring out how to access shredded variants (aka manually construct StructArrays) -- and then once we had that worked out, we can work out how to unshred them

alamb · 2025-07-12T19:18:21Z

parquet-variant/src/builder.rs

    fn extend<T: IntoIterator<Item = S>>(&mut self, iter: T) {
        for field_name in iter {
            self.upsert_field_name(field_name.as_ref());
        }
    }
 }

+/// Read-only metadata builder that validates field names against an existing metadata dictionary


I don't really understand the usecase for ReadOnlyMetadataBuilder? If a shredded variant is already guaranteed to have the correct field names, it means we don't need to update the variant or build a new one so I am not sure even how a VariantBuilder would be involved 🤔

An unshredding operation requires two kinds of variant "building":

Shredded fields need a full blown variant builder, because they're strongly typed and we need to encode them as variant. But the existing VariantMetadata already contains all their field names, so the builder only needs to look up their field ids. The spec requires the ability to unshred using a read-only metadata dictionary, so why go copying it?

Unshredded fields can be copied as-is (I would honestly prefer copying raw bytes instead of parsing and then unparsing them the way [Variant] Support appending complex variants in VariantBuilder #7914 seems to favor). We need a builder to keep track of the updated offsets as we interleave existing variant value bytes with newly-unshredded value bytes.

Imagine a partially shredded variant column v:

Fields a, m and x live in the typed_value column as a perfectly shredded struct

Furthermore, x is itself a shredded struct with its own fields i and j

Furthermore, i is a variant column (also using the same top-level metadata)

Furthermore, j is a partially shredded struct

Fields b, n, and y live in the value column as a variant object

When unshredding, we need to turn a, b, and x from strongly typed values into variant objects -- the latter recursively so -- which requires a variant builder. The recursive unshredding of j also requires a builder of its own.

Once we have all the field values in variant form, we need to create a new variant object for v. We can copy the bytes of b, n and y as a contiguous block if we want -- relying on field id sorting to give the required ordering a, b, m, n, x, y -- but we do have to copy those bytes in order to inject the newly created variant value bytes for a, m, and x. Plus, it's entirely possible other newly-unshredded sibling values before and after v would have required copying its value bytes even if v were completely unshredded.

Shredded fields need a full blown variant builder, because they're strongly typed and we need to encode them as variant.

When unshredding, we need to turn a, b, and x from strongly typed values into variant objects -- the latter recursively so -- which requires a variant builder. The recursive unshredding of j also requires a builder of its own.

I see -- you are imagining unshredding the varaint phiscally (aka forming the bytes for a new unshredded variant value using a VariantBuilder) to access the shredded fields.

What if we made a view into the variant so we didn't have to copy anything to read the Variant? If we could get a Variant that accessed the shredded fields, physically unshredding becomes an exercise in creating a new builder and calling append_value:

Hypothetical unshred kernel

// Given a shredded variant that is itself a Variant and a view on the shredded columns fn unshred(shredded_variant: Varaint) -> (Vec<u8>, Vec<u8>) { // deep copy the shredded variant let mut variant_builder = VariantBuilder::new() variant_builder.append_value(shredded_variant) }

Shredded Variants as "Views"

The question then is how to implement a view. Accessing primitive types as a Variant is straightforward

fn shredded_int_as_variant(value: i64) -> Variant { Varant::from(value) } // ... other primitive types here .. fn shredded_str_as_variant(value: &str) -> Variant { Variant::from(value) }

The tricky bit is accessing partially shredded objects. Maybe we could add the shredded fields to VariantObject, something like

struct VariantObject { // ... existing fields .. // fields that were shredded and live elsewhere shredded_fields: IndexMap<&str, Variant> }

Then we can update all the accessor methods, etc.

An alternate mechanism that might be more explicit would a new explicit enumeration in Variant like

struct ShreddedObject { object: VariantObject, shredded_fields: IndexMap<&str, Variant> } enum Variant { ShreddedObject(..) }

We could Box the shredded fields to keep the size of Variant down if needed

Hows does this handle the example?

Imagine a partially shredded variant column v:

Fields a, m and x live in the typed_value column as a perfectly shredded struct

Furthermore, x is itself a shredded struct with its own fields i and j

Furthermore, i is a variant column (also using the same top-level metadata)

Furthermore, j is a partially shredded struct

Fields b, n, and y live in the value column as a variant object

I think the view approach could handle this example. For example

v is a Variant::Object(..) with shredded_fields

v 's fields in the value_column contain b, n and y

v::shredded_fields contains the Variants for a, m, and x

`x` is a `VariantObject()` ...

Hi, how do we plan on storing typed_values? Do we plan on encoding it as a Variant and storing the binary representation of it? Or since it is supposed to be strongly typed, we want to map to specific DataTypes, which would require something like #7921

I've been playing around with the unshredding logic and the most naive version would look something like this. I don't mean this to be the actual implementation, but I am curious how we plan on representing typed_value.

naive reconstruct logic

pub fn reconstruct_variant( metadata: &[u8], value: Option<&[u8]>, typed_value: Option<(&[u8], &[u8])>, // this is itself a variant ) -> Result<(Vec<u8>, Vec<u8>), ArrowError> { match typed_value { Some((typed_metadata, typed_value)) => { let variant = Variant::try_new(typed_metadata, typed_value)?; match variant { Variant::Null | Variant::Int8(_) | Variant::Int16(_) | Variant::Int32(_) | Variant::Int64(_) | Variant::Date(_) | Variant::TimestampMicros(_) | Variant::TimestampNtzMicros(_) | Variant::Decimal4(_) | Variant::Decimal8(_) | Variant::Decimal16(_) | Variant::Float(_) | Variant::Double(_) | Variant::BooleanTrue | Variant::BooleanFalse | Variant::Binary(_) | Variant::String(_) | Variant::ShortString(_) => Ok((typed_metadata.to_vec(), typed_value.to_vec())), Variant::Object(shredded_object) => { if let Some(value) = value { let variant = Variant::try_new(metadata, value)?; let partially_shredded_object = variant.as_object().ok_or(ArrowError::InvalidArgumentError( "partially shredded value must be an object".to_string(), ))?; let shredded_keys: HashSet<&str> = HashSet::from_iter(shredded_object.iter().map(|(k, _)| k)); let partially_shredded_keys: HashSet<&str> = HashSet::from_iter(partially_shredded_object.iter().map(|(k, _)| k)); if !shredded_keys.is_disjoint(&partially_shredded_keys) { return Err(ArrowError::InvalidArgumentError( "object keys must be disjoint".to_string(), )); } // union the two objects together let mut variant_builder = VariantBuilder::new(); let mut object_builder = variant_builder.new_object(); for (f, v) in shredded_object.iter() { object_builder.insert(f, v); } for (f, v) in partially_shredded_object.iter() { object_builder.insert(f, v); } object_builder.finish()?; return Ok(variant_builder.finish()); } Ok((typed_metadata.to_vec(), typed_value.to_vec())) } Variant::List(_) => { if value.is_some() { return Err(ArrowError::InvalidArgumentError( "shredded array must not conflict with variant value".to_string(), )); } return Ok((typed_metadata.to_vec(), typed_value.to_vec())); } } } None => match value { Some(value) => Ok((metadata.to_vec(), value.to_vec())), None => Err(ArrowError::InvalidArgumentError( "No value or typed value provided".to_string(), )), }, } }

If we strongly type typed_value to whatever the variant type is, the primitive variants work nicely, since we'd be storing a PrimitiveArray<T> and we can easily index into it.

For complex variants like objects/lists, I'm having a hard time figuring out how to strongly type these values. Maybe for these cases we should just store the binary encoded version of a Variant? (metadata, value).

how do we plan on storing typed_values? Do we plan on encoding it as a Variant and storing the binary representation of it?

As I understand it, we don't need to store typed_value in the first place.

This is low-level variant code here, and a given variant builder only knows about the unshredded bits (it's trying to build up a value column). Whoever manages the builder will be in charge of the shredded bits, and is also in charge of providing an appropriate metadata "builder" for the metadata column. I think @alamb is proposing in this specific case to convert the typed_value into a proper variant, but then "merge" it into the existing variant using a view, instead of copying (some of my thoughts on that below).

Aside: I think there are actually three kinds of metadata builders

Today's traditional MetadataBuilder (renamed here as DefaultMetadataBuilder)

Used to build up a top-level variant. Every newly encountered path is an insertion into that metadata dictionary

A metadata builder used only for unshredding a typed_value column (ReadOnlyMetadataBuilder here)

A "borrowed" parent metadata builder (used during shredding operations, not present in this PR)

See [Variant] Define shredding schema for VariantArrayBuilder #7921 (comment)

Used for building up variant columns that happen to live inside the typed_value column of some shredded parent variant ancestor column, since the spec requires that outer metadata dictionary to contain all paths reachable through that ancestor variant column.

What if we made a view into the variant so we didn't have to copy anything to read the Variant?

Interesting. Trying to wrap my head around this idea, so what follows is likely a bit rambly:

What do we gain by using views at this low level?

At the high level, the VariantArray just stores an actual (strongly-typed) typed_value columns. It doesn't need to unshred at all, unless/until somebody executes a variant_get kernel with a path that ends inside the typed_value column.

In that case, we have no choice but to unshred that path, which is a plain old ordinary variant building exercise because the spec forbids both value and typed_value to contain the same path -- except the metadata column is already fully built and we can just reuse it as ReadOnlyMetadata.

If somebody does a variant_get on any path that ends inside the value column, we just extract that variant as normal (the shredded bits provably don't matter).

If somebody actually asks to unshred the entire top-level column (e.g. invokes unshred_variant in preparation for a write), the result is a simple struct-of-binary columns. No (low level) Variant anywhere in sight -- they're an internal detail of that kernel.

If we're anyway forced to write out normal variant bytes, where/how does the view help along the way?

Nesting will be "fun"

In the example from my comment on the other PR (linked above), one could have a (horrible) nested scenario like: v.typed_value.a.b.c.w.typed_value.x.y.z.u: VARIANT. If we tried to unshred that using views, I'm pretty sure it would quickly fall apart, because Variant is ultimately already a view on top of the (one) underlying &[u8] it was built from. So there's no way a field of a Variant::ShreddedObject could itself contain other Variant::ShreddedObject.

Intuitive claim (unsubstantiated):

If we could get away with "views" here on the unshredding path, why can't the native builder get away with "views" internally to avoid all the repeated copying of bytes as we unwind nested builders?

But every time I've tried to get fancy with copy-free nested builders, it gets ugly fast.

So (again, intuitively), I would conclude that any attempt at views here would also get ugly fast.

Or, if it turns out we can elegantly solve it here, let's be sure to apply the solution there!

Isolated reaction: I would favor Variant::ShreddedObject as its own enum variant, rather than making Variant::Object more complex. They're pretty fundamentally different cases.

... and then the VariantArray's variant builder accepts the resulting Variant::ShreddedObject and actually turns it to bytes?

... at which point the ReadOnlyMetadataBuilder becomes useful, because that final operation shouldn't allocate any new field ids.

Observation: I think the Variant::ShreddedObject is mostly (only?) useful on the read (unshredding) path; a writer who wants to produce shredded variant almost certainly doesn't want to work with this API because it's ultimately just a collection of variant values that were probably strongly typed at first and now have to pay a round trip through a Variant? Shredding would be able to consume these objects, e.g. if re-shredding something that was previously shredded, but it's not the preferred approach for turning e.g. a StructArray directly into VariantArray.

Second observation: How helpful is it, to have the Variant::ShreddedObject::value member? Should we just extract its fields directly into the map instead? Then we just have to sort the map and write them out again. We might also want to have the map key off field ids instead of &str, using lookups into the top-level VariantMetadata object. That way, the final to-bytes transformation doesn't need a metadata builder at all.

Observation: I think the Variant::ShreddedObject is mostly (only?) useful on the read (unshredding) path; a writer who wants to produce shredded variant almost certainly doesn't want to work with this API because it's ultimately just a collection of variant values that were probably strongly typed at first and now have to pay a round trip through a Variant?

Yes I agree this would be mostly for read (rather than on the write path)

Shredding would be able to consume these objects, e.g. if re-shredding something that was previously shredded, but it's not the preferred approach for turning e.g. a StructArray directly into VariantArray.

Right

scovich added 2 commits July 11, 2025 09:31

[Variant] Make MetadataBuilder a trait that VariantMetadata can imple…

8f5133a

…ment

Add the actual ReadOnlyMetadataBuilder

0fddd42

github-actions bot added the parquet Changes to the parquet crate label Jul 11, 2025

clippy + fmt

b1d8c53

scovich commented Jul 12, 2025

View reviewed changes

alamb reviewed Jul 12, 2025

View reviewed changes

alamb mentioned this pull request Jul 13, 2025

[Variant] Add low level support for shredding and unshredding #7715

Open

scovich mentioned this pull request Jul 15, 2025

[Variant] Add variant_get compute kernel #7919

Merged

alamb mentioned this pull request Jul 16, 2025

[Variant] Support variant_get kernel for shredded variants #7941

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Variant] VariantBuilder with VariantMetadata instead of MetadataBuilder #7915

[Variant] VariantBuilder with VariantMetadata instead of MetadataBuilder #7915

Uh oh!

scovich commented Jul 11, 2025 •

edited

Loading

Uh oh!

scovich commented Jul 11, 2025

Uh oh!

scovich Jul 12, 2025

Uh oh!

alamb commented Jul 12, 2025

Uh oh!

alamb left a comment •

edited

Loading

Uh oh!

alamb Jul 12, 2025

Uh oh!

scovich Jul 12, 2025 •

edited

Loading

Uh oh!

alamb Jul 13, 2025

Uh oh!

friendlymatthew Jul 13, 2025

Uh oh!

scovich Jul 14, 2025 •

edited

Loading

Uh oh!

scovich Jul 14, 2025

Uh oh!

scovich Jul 14, 2025

Uh oh!

scovich Jul 14, 2025 •

edited

Loading

Uh oh!

scovich Jul 14, 2025 •

edited

Loading

Uh oh!

alamb Jul 15, 2025

Uh oh!

Uh oh!

[Variant] VariantBuilder with VariantMetadata instead of MetadataBuilder #7915

Are you sure you want to change the base?

[Variant] VariantBuilder with VariantMetadata instead of MetadataBuilder #7915

Uh oh!

Conversation

scovich commented Jul 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

scovich commented Jul 11, 2025

Uh oh!

scovich Jul 12, 2025

Choose a reason for hiding this comment

Uh oh!

alamb commented Jul 12, 2025

Uh oh!

alamb left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alamb Jul 12, 2025

Choose a reason for hiding this comment

Uh oh!

scovich Jul 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alamb Jul 13, 2025

Choose a reason for hiding this comment

Hypothetical unshred kernel

Shredded Variants as "Views"

Hows does this handle the example?

Uh oh!

friendlymatthew Jul 13, 2025

Choose a reason for hiding this comment

Uh oh!

scovich Jul 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

scovich Jul 14, 2025

Choose a reason for hiding this comment

Uh oh!

scovich Jul 14, 2025

Choose a reason for hiding this comment

Uh oh!

scovich Jul 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

scovich Jul 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alamb Jul 15, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

scovich commented Jul 11, 2025 •

edited

Loading

alamb left a comment •

edited

Loading

scovich Jul 12, 2025 •

edited

Loading

Hypothetical `unshred` kernel

scovich Jul 14, 2025 •

edited

Loading

scovich Jul 14, 2025 •

edited

Loading

scovich Jul 14, 2025 •

edited

Loading