Skip to content

[Variant] VariantBuilder with VariantMetadata instead of MetadataBuilder #7915

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

scovich
Copy link
Contributor

@scovich scovich commented Jul 11, 2025

Which issue does this PR close?

Pathfinding related to the unshredding aspects of

Other pathfinding relevant to unshredding, that this PR does not attempt:

  • Ability for a builder to wrap an existing buffer (which a variant array builder could use to pack multiple variants into the same slice of memory)
  • Ability to insert raw bytes of a variant as an object field or array element. For uniformity, the top-level builder may as well support the same.

Rationale for this change

The variant shredding spec requires that the top-level variant metadata already contains all field names, whether those were shredded or not. That means an unshredding operation must use the existing VariantMetadata for the column, and there's no point paying to build a new one.

What changes are included in this PR?

Pathfinding change, where MetadataBuilder is renamed as DefaultMetadataBuilder that implements a new MetadataBuilder trait, which VariantMetadata can also implement. VariantBuilder then takes a type parameter for the specific M: MetadataBuilder that it uses.

For convenience, VariantBuilder becomes a type alias for GenericVariantBuilder<DefaultMetadataBuilder>, so most callers don't need to care.

This change also requires several previously-infallible methods to become fallible, because attempting to insert a non-existent field name must fail.

Are these changes tested?

Existing tests should cover the refactoring.

A couple new tests lightly exercise the new read-only metadata.

Are there any user-facing changes?

A whole bunch:

  • Methods become fallible
  • MetadataBuilder is now generic and may not impl Default any more
  • etc.

@github-actions github-actions bot added the parquet Changes to the parquet crate label Jul 11, 2025
@scovich
Copy link
Contributor Author

scovich commented Jul 11, 2025

Attn @alamb @friendlymatthew - I'd love to know your initial reactions to this change

Comment on lines 950 to 952
if self.fields.insert(field_id, field_start).is_some() && self.validate_unique_fields {
self.duplicate_fields.insert(field_id);
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that insert is fallible, we can fail-fast instead of tracking a set of duplicate fields for finish to worry about:

Suggested change
if self.fields.insert(field_id, field_start).is_some() && self.validate_unique_fields {
self.duplicate_fields.insert(field_id);
}
if self.fields.insert(field_id, field_start).is_some() && self.validate_unique_fields {
return Err(... duplicate field {key} ...);
}

@alamb
Copy link
Contributor

alamb commented Jul 12, 2025

Ability for a builder to wrap an existing buffer (which a variant array builder could use to pack multiple variants into the same slice of memory)

I have some proposal of that API in this PR

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @scovich -- it is neat to see progress towards shredding

I don't fully understand where you are heading with this PR -- specifically it seems like the assumption in this PR is that we would use a VariantBuilder to recreate a shredded Variant.

What I was imagining is that we are able to return a Variant that is a read only view into the underlying buffer(s) from VariantArray::value() -- we may have to add to the Variant enum like Variant::ShreddedObject or something

However, if we can figure that out, then I think the "unshredding" operaton simply looks like calling VariantArray::value() and then making a builder to write into a new VariantArray (without shredding)

Now that @friendlymatthew is close to supporting appending complex variants directly in the builder, I think it could be quite elegant

Maybe once we merge #7905 we could work on figuring out how to access shredded variants (aka manually construct StructArrays) -- and then once we had that worked out, we can work out how to unshred them

fn extend<T: IntoIterator<Item = S>>(&mut self, iter: T) {
for field_name in iter {
self.upsert_field_name(field_name.as_ref());
}
}
}

/// Read-only metadata builder that validates field names against an existing metadata dictionary
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really understand the usecase for ReadOnlyMetadataBuilder? If a shredded variant is already guaranteed to have the correct field names, it means we don't need to update the variant or build a new one so I am not sure even how a VariantBuilder would be involved 🤔

Copy link
Contributor Author

@scovich scovich Jul 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An unshredding operation requires two kinds of variant "building":

  • Shredded fields need a full blown variant builder, because they're strongly typed and we need to encode them as variant. But the existing VariantMetadata already contains all their field names, so the builder only needs to look up their field ids. The spec requires the ability to unshred using a read-only metadata dictionary, so why go copying it?
  • Unshredded fields can be copied as-is (I would honestly prefer copying raw bytes instead of parsing and then unparsing them the way [Variant] Support appending complex variants in VariantBuilder #7914 seems to favor). We need a builder to keep track of the updated offsets as we interleave existing variant value bytes with newly-unshredded value bytes.

Imagine a partially shredded variant column v:

  • Fields a, m and x live in the typed_value column as a perfectly shredded struct
    • Furthermore, x is itself a shredded struct with its own fields i and j
      • Furthermore, i is a variant column (also using the same top-level metadata)
      • Furthermore, j is a partially shredded struct
  • Fields b, n, and y live in the value column as a variant object

When unshredding, we need to turn a, b, and x from strongly typed values into variant objects -- the latter recursively so -- which requires a variant builder. The recursive unshredding of j also requires a builder of its own.

Once we have all the field values in variant form, we need to create a new variant object for v. We can copy the bytes of b, n and y as a contiguous block if we want -- relying on field id sorting to give the required ordering a, b, m, n, x, y -- but we do have to copy those bytes in order to inject the newly created variant value bytes for a, m, and x. Plus, it's entirely possible other newly-unshredded sibling values before and after v would have required copying its value bytes even if v were completely unshredded.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shredded fields need a full blown variant builder, because they're strongly typed and we need to encode them as variant.

When unshredding, we need to turn a, b, and x from strongly typed values into variant objects -- the latter recursively so -- which requires a variant builder. The recursive unshredding of j also requires a builder of its own.

I see -- you are imagining unshredding the varaint phiscally (aka forming the bytes for a new unshredded variant value using a VariantBuilder) to access the shredded fields.

What if we made a view into the variant so we didn't have to copy anything to read the Variant? If we could get a Variant that accessed the shredded fields, physically unshredding becomes an exercise in creating a new builder and calling append_value:

Hypothetical unshred kernel

// Given a shredded variant that is itself a Variant and a view on the shredded columns 
fn unshred(shredded_variant: Varaint) -> (Vec<u8>, Vec<u8>) {
  // deep copy the shredded variant
  let mut variant_builder = VariantBuilder::new()
  variant_builder.append_value(shredded_variant)
}

Shredded Variants as "Views"

The question then is how to implement a view. Accessing primitive types as a Variant is straightforward

fn shredded_int_as_variant(value: i64) -> Variant {
  Varant::from(value)
}
// ... other primitive types here ..
fn shredded_str_as_variant(value: &str) -> Variant {
  Variant::from(value)
}

The tricky bit is accessing partially shredded objects. Maybe we could add the shredded fields to VariantObject, something like

struct VariantObject {
 // ... existing fields ..
 // fields that were shredded and live elsewhere
 shredded_fields: IndexMap<&str, Variant>
}

Then we can update all the accessor methods, etc.

An alternate mechanism that might be more explicit would a new explicit enumeration in Variant like

struct ShreddedObject {
  object: VariantObject,
  shredded_fields: IndexMap<&str, Variant>
}

enum Variant {
  ShreddedObject(..)
}

We could Box the shredded fields to keep the size of Variant down if needed

Hows does this handle the example?

Imagine a partially shredded variant column v:

  • Fields a, m and x live in the typed_value column as a perfectly shredded struct

    • Furthermore, x is itself a shredded struct with its own fields i and j

      • Furthermore, i is a variant column (also using the same top-level metadata)
      • Furthermore, j is a partially shredded struct
  • Fields b, n, and y live in the value column as a variant object

I think the view approach could handle this example. For example

  • v is a Variant::Object(..) with shredded_fields
  • v 's fields in the value_column contain b, n and y
  • v::shredded_fields contains the Variants for a, m, and x
  • `x` is a `VariantObject()` ...
    

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, how do we plan on storing typed_values? Do we plan on encoding it as a Variant and storing the binary representation of it? Or since it is supposed to be strongly typed, we want to map to specific DataTypes, which would require something like #7921

I've been playing around with the unshredding logic and the most naive version would look something like this. I don't mean this to be the actual implementation, but I am curious how we plan on representing typed_value.

naive reconstruct logic
pub fn reconstruct_variant(
    metadata: &[u8],
    value: Option<&[u8]>,
    typed_value: Option<(&[u8], &[u8])>, // this is itself a variant
) -> Result<(Vec<u8>, Vec<u8>), ArrowError> {
    match typed_value {
        Some((typed_metadata, typed_value)) => {
            let variant = Variant::try_new(typed_metadata, typed_value)?;

            match variant {
                Variant::Null
                | Variant::Int8(_)
                | Variant::Int16(_)
                | Variant::Int32(_)
                | Variant::Int64(_)
                | Variant::Date(_)
                | Variant::TimestampMicros(_)
                | Variant::TimestampNtzMicros(_)
                | Variant::Decimal4(_)
                | Variant::Decimal8(_)
                | Variant::Decimal16(_)
                | Variant::Float(_)
                | Variant::Double(_)
                | Variant::BooleanTrue
                | Variant::BooleanFalse
                | Variant::Binary(_)
                | Variant::String(_)
                | Variant::ShortString(_) => Ok((typed_metadata.to_vec(), typed_value.to_vec())),
                Variant::Object(shredded_object) => {
                    if let Some(value) = value {
                        let variant = Variant::try_new(metadata, value)?;

                        let partially_shredded_object =
                            variant.as_object().ok_or(ArrowError::InvalidArgumentError(
                                "partially shredded value must be an object".to_string(),
                            ))?;

                        let shredded_keys: HashSet<&str> =
                            HashSet::from_iter(shredded_object.iter().map(|(k, _)| k));

                        let partially_shredded_keys: HashSet<&str> =
                            HashSet::from_iter(partially_shredded_object.iter().map(|(k, _)| k));

                        if !shredded_keys.is_disjoint(&partially_shredded_keys) {
                            return Err(ArrowError::InvalidArgumentError(
                                "object keys must be disjoint".to_string(),
                            ));
                        }

                        // union the two objects together

                        let mut variant_builder = VariantBuilder::new();
                        let mut object_builder = variant_builder.new_object();

                        for (f, v) in shredded_object.iter() {
                            object_builder.insert(f, v);
                        }

                        for (f, v) in partially_shredded_object.iter() {
                            object_builder.insert(f, v);
                        }

                        object_builder.finish()?;

                        return Ok(variant_builder.finish());
                    }

                    Ok((typed_metadata.to_vec(), typed_value.to_vec()))
                }
                Variant::List(_) => {
                    if value.is_some() {
                        return Err(ArrowError::InvalidArgumentError(
                            "shredded array must not conflict with variant value".to_string(),
                        ));
                    }

                    return Ok((typed_metadata.to_vec(), typed_value.to_vec()));
                }
            }
        }
        None => match value {
            Some(value) => Ok((metadata.to_vec(), value.to_vec())),
            None => Err(ArrowError::InvalidArgumentError(
                "No value or typed value provided".to_string(),
            )),
        },
    }
}

If we strongly type typed_value to whatever the variant type is, the primitive variants work nicely, since we'd be storing a PrimitiveArray<T> and we can easily index into it.

For complex variants like objects/lists, I'm having a hard time figuring out how to strongly type these values. Maybe for these cases we should just store the binary encoded version of a Variant? (metadata, value).

Copy link
Contributor Author

@scovich scovich Jul 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how do we plan on storing typed_values? Do we plan on encoding it as a Variant and storing the binary representation of it?

As I understand it, we don't need to store typed_value in the first place.

This is low-level variant code here, and a given variant builder only knows about the unshredded bits (it's trying to build up a value column). Whoever manages the builder will be in charge of the shredded bits, and is also in charge of providing an appropriate metadata "builder" for the metadata column. I think @alamb is proposing in this specific case to convert the typed_value into a proper variant, but then "merge" it into the existing variant using a view, instead of copying (some of my thoughts on that below).

Aside: I think there are actually three kinds of metadata builders

  • Today's traditional MetadataBuilder (renamed here as DefaultMetadataBuilder)
    • Used to build up a top-level variant. Every newly encountered path is an insertion into that metadata dictionary
  • A metadata builder used only for unshredding a typed_value column (ReadOnlyMetadataBuilder here)
  • A "borrowed" parent metadata builder (used during shredding operations, not present in this PR)

What if we made a view into the variant so we didn't have to copy anything to read the Variant?

Interesting. Trying to wrap my head around this idea, so what follows is likely a bit rambly:

  • What do we gain by using views at this low level?
    • At the high level, the VariantArray just stores an actual (strongly-typed) typed_value columns. It doesn't need to unshred at all, unless/until somebody executes a variant_get kernel with a path that ends inside the typed_value column.
      • In that case, we have no choice but to unshred that path, which is a plain old ordinary variant building exercise because the spec forbids both value and typed_value to contain the same path -- except the metadata column is already fully built and we can just reuse it as ReadOnlyMetadata.
    • If somebody does a variant_get on any path that ends inside the value column, we just extract that variant as normal (the shredded bits provably don't matter).
    • If somebody actually asks to unshred the entire top-level column (e.g. invokes unshred_variant in preparation for a write), the result is a simple struct-of-binary columns. No (low level) Variant anywhere in sight -- they're an internal detail of that kernel.
      • If we're anyway forced to write out normal variant bytes, where/how does the view help along the way?
  • Nesting will be "fun"
    • In the example from my comment on the other PR (linked above), one could have a (horrible) nested scenario like: v.typed_value.a.b.c.w.typed_value.x.y.z.u: VARIANT. If we tried to unshred that using views, I'm pretty sure it would quickly fall apart, because Variant is ultimately already a view on top of the (one) underlying &[u8] it was built from. So there's no way a field of a Variant::ShreddedObject could itself contain other Variant::ShreddedObject.
  • Intuitive claim (unsubstantiated):
    • If we could get away with "views" here on the unshredding path, why can't the native builder get away with "views" internally to avoid all the repeated copying of bytes as we unwind nested builders?
    • But every time I've tried to get fancy with copy-free nested builders, it gets ugly fast.
    • So (again, intuitively), I would conclude that any attempt at views here would also get ugly fast.
    • Or, if it turns out we can elegantly solve it here, let's be sure to apply the solution there!
  • Isolated reaction: I would favor Variant::ShreddedObject as its own enum variant, rather than making Variant::Object more complex. They're pretty fundamentally different cases.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... and then the VariantArray's variant builder accepts the resulting Variant::ShreddedObject and actually turns it to bytes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... at which point the ReadOnlyMetadataBuilder becomes useful, because that final operation shouldn't allocate any new field ids.

Copy link
Contributor Author

@scovich scovich Jul 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Observation: I think the Variant::ShreddedObject is mostly (only?) useful on the read (unshredding) path; a writer who wants to produce shredded variant almost certainly doesn't want to work with this API because it's ultimately just a collection of variant values that were probably strongly typed at first and now have to pay a round trip through a Variant? Shredding would be able to consume these objects, e.g. if re-shredding something that was previously shredded, but it's not the preferred approach for turning e.g. a StructArray directly into VariantArray.

Copy link
Contributor Author

@scovich scovich Jul 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Second observation: How helpful is it, to have the Variant::ShreddedObject::value member? Should we just extract its fields directly into the map instead? Then we just have to sort the map and write them out again. We might also want to have the map key off field ids instead of &str, using lookups into the top-level VariantMetadata object. That way, the final to-bytes transformation doesn't need a metadata builder at all.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Observation: I think the Variant::ShreddedObject is mostly (only?) useful on the read (unshredding) path; a writer who wants to produce shredded variant almost certainly doesn't want to work with this API because it's ultimately just a collection of variant values that were probably strongly typed at first and now have to pay a round trip through a Variant?

Yes I agree this would be mostly for read (rather than on the write path)

Shredding would be able to consume these objects, e.g. if re-shredding something that was previously shredded, but it's not the preferred approach for turning e.g. a StructArray directly into VariantArray.

Right

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants