Variant to arrow utf8 #8600

sdf-jkl · 2025-10-13T20:55:22Z

Which issue does this PR close?

Closes [Variant] Add variant_to_arrow Utf-8, LargeUtf8, Utf8View types support #8567.

Rationale for this change

Add support for Variant::Utf-8, LargeUtf8, Utf8View. This needs to add a new builder VariantToStringArrowRowBuilder, because LargeUtf8, Utf8View are not ArrowPritimitiveType's

What changes are included in this PR?

Added support for Variant::Utf-8, LargeUtf8, Utf8View by adding a new enum and builder for utf8 and largeUtf8 and added utf8view to primitive builder.
Added a new variable data_capacity to make_string_variant_to_arrow_row_builder to support string types.
Updated the make_string_variant_to_arrow_row_builder in variant_get to include the variable.

Are these changes tested?

Added a variant_get test for utf8 type and created two separate tests for largeUtf8 and Utf8view because these types can't be shredded.

Are there any user-facing changes?

No

sdf-jkl · 2025-10-14T00:07:14Z

@alamb @scovich Please review when you can, thank you!

klion26 · 2025-10-14T08:20:23Z

parquet-variant-compute/src/variant_get.rs


+    perfectly_shredded_to_arrow_primitive_test!(
+        get_variant_perfectly_shredded_utf8_as_utf8,
+        DataType::Utf8,


Do we need to add tests for other types(LargeUtf8/Utf8View) here?

The test here wants to cover the variant_get logic, and the tests added in variant_to_arrow.rs were to cover the logic of the builder?

Shredding is not supported for LargeUtf8/Utf8View' per specification.

I originally added the tests for them inside variant_get but got the error saying these types do not support shredding.

That would be from the VariantArray constructor, which invokes this code:

fn canonicalize_and_verify_data_type( data_type: &DataType, ) -> Result<Cow<'_, DataType>, ArrowError> { ... let new_data_type = match data_type { ... // We can _possibly_ allow (some of) these some day? LargeBinary | LargeUtf8 | Utf8View | ListView(_) | LargeList(_) | LargeListView(_) => { fail!() }

I originally added that code because I was not confident I knew what the correct behavior should be. The shredding spec says:

Shredded values must use the following Parquet types:

Variant Type Parquet Physical Type Parquet Logical Type

...

binary BINARY

string BINARY STRING

...

array GROUP; see Arrays below LIST

But I'm pretty sure that doesn't need to constrain the use of DataType::Utf8 vs. DataType::LargeUtf8 vs DataType::Utf8Vew? (similar story for the various in-memory layouts of lists and binary values)?

A similar dilemma is that the metadata column is supposed to be parquet BINARY type, but arrow-parquet produces BinaryViewArray by default. Right now we replace DataType::Binary with DataType::BinaryView and force a cast as needed.

If we think the shredding spec forbids LargeUtf8 or Utf8View then we probably need to cast binary views back to normal binary as well.

If we don't think the shredding spec forbids those types, then we should probably support metadata: LargeBinaryArray (tho the narrowing cast to BinaryArray might fail if the offsets really don't fit in 32 bits).

@alamb @cashmand, any advice here?

klion26 · 2025-10-14T08:31:13Z