Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 12 additions & 0 deletions parquet-variant-compute/src/variant_get.rs
Original file line number Diff line number Diff line change
Expand Up @@ -138,6 +138,7 @@ fn shredded_get_path(
as_type,
cast_options,
target.len(),
target.inner().get_buffer_memory_size(),
)?;
for i in 0..target.len() {
if target.is_null(i) {
Expand Down Expand Up @@ -763,6 +764,13 @@ mod test {
BooleanArray::from(vec![Some(true), Some(false), Some(true)])
);

perfectly_shredded_to_arrow_primitive_test!(
get_variant_perfectly_shredded_utf8_as_utf8,
DataType::Utf8,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to add tests for other types(LargeUtf8/Utf8View) here?

The test here wants to cover the variant_get logic, and the tests added in variant_to_arrow.rs were to cover the logic of the builder?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shredding is not supported for LargeUtf8/Utf8View' per specification.

I originally added the tests for them inside variant_get but got the error saying these types do not support shredding.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be from the VariantArray constructor, which invokes this code:

fn canonicalize_and_verify_data_type(
    data_type: &DataType,
) -> Result<Cow<'_, DataType>, ArrowError> {
      ...
    let new_data_type = match data_type {
          ...
        // We can _possibly_ allow (some of) these some day?                                                                                                                                                                                                                   
        LargeBinary | LargeUtf8 | Utf8View | ListView(_) | LargeList(_) | LargeListView(_) => {
            fail!()
        }

I originally added that code because I was not confident I knew what the correct behavior should be. The shredding spec says:

Shredded values must use the following Parquet types:

Variant Type Parquet Physical Type Parquet Logical Type
...
binary BINARY
string BINARY STRING
...
array GROUP; see Arrays below LIST

But I'm pretty sure that doesn't need to constrain the use of DataType::Utf8 vs. DataType::LargeUtf8 vs DataType::Utf8Vew? (similar story for the various in-memory layouts of lists and binary values)?

A similar dilemma is that the metadata column is supposed to be parquet BINARY type, but arrow-parquet produces BinaryViewArray by default. Right now we replace DataType::Binary with DataType::BinaryView and force a cast as needed.

If we think the shredding spec forbids LargeUtf8 or Utf8View then we probably need to cast binary views back to normal binary as well.

If we don't think the shredding spec forbids those types, then we should probably support metadata: LargeBinaryArray (tho the narrowing cast to BinaryArray might fail if the offsets really don't fit in 32 bits).

@alamb @cashmand, any advice here?

perfectly_shredded_utf8_variant_array,
StringArray::from(vec![Some("foo"), Some("bar"), Some("baz")])
);

macro_rules! perfectly_shredded_variant_array_fn {
($func:ident, $typed_value_gen:expr) => {
fn $func() -> ArrayRef {
Expand All @@ -786,6 +794,10 @@ mod test {
};
}

perfectly_shredded_variant_array_fn!(perfectly_shredded_utf8_variant_array, || {
StringArray::from(vec![Some("foo"), Some("bar"), Some("baz")])
});

perfectly_shredded_variant_array_fn!(perfectly_shredded_bool_variant_array, || {
BooleanArray::from(vec![Some(true), Some(false), Some(true)])
});
Expand Down
186 changes: 184 additions & 2 deletions parquet-variant-compute/src/variant_to_arrow.rs
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,8 @@
// under the License.

use arrow::array::{
ArrayRef, BinaryViewArray, NullBufferBuilder, PrimitiveBuilder, builder::BooleanBuilder,
ArrayRef, BinaryViewArray, LargeStringBuilder, NullBufferBuilder, PrimitiveBuilder,
StringBuilder, StringViewBuilder, builder::BooleanBuilder,
};
use arrow::compute::CastOptions;
use arrow::datatypes::{self, ArrowPrimitiveType, DataType};
Expand Down Expand Up @@ -52,6 +53,12 @@ pub(crate) enum PrimitiveVariantToArrowRowBuilder<'a> {
TimestampNano(VariantToTimestampArrowRowBuilder<'a, datatypes::TimestampNanosecondType>),
TimestampNanoNtz(VariantToTimestampNtzArrowRowBuilder<'a, datatypes::TimestampNanosecondType>),
Date(VariantToPrimitiveArrowRowBuilder<'a, datatypes::Date32Type>),
StringView(VariantToUtf8ViewArrowBuilder<'a>),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We added StringView to the PrimitiveVariantToArrowRowBuilder and the other two to StringVariantToArrowRowBuilder, is there a particular reason for this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Allocating memory for primitive builders only requires a capacity field for the number of items to pre-allocate.

For Utf8/LargeUtf8 builders capacity and another field are required: data_capacity - the total number of (utf8) bytes to allocate.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see any meaningful call sites that pass a data capacity -- only some unit tests.

Ultimately, variant_get will call make_variant_to_arrow_row_builder, and I don't think that code has any way to predict what the correct data capacity might be? How could one even define "correct" when a single value would be applied to each of potentially many string row builders that will be created, when each of those builders could see a completely different distribution of string sizes and null values?

This is very different from the row capacity value, which IS precisely known and applies equally to all builders variant_get might need to create.

Also -- these capacities are just pre-allocation hints; passing too large a hint temporarily wastes a bit of memory, and passing too small a hint just means one or more internal reallocations.

I would vote to just choose a reasonable default "average string size" and multiply that by the row count to obtain a data capacity hint when needed.

TBD whether that average string size should be a parameter that originates with the caller of variant_get and gets plumbed all the way through -- but that seems like a really invasive API change for very little benefit. Seems like a simple const would be much better.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A big benefit of the simpler approach to data capacity: All the string builders are, in fact, primitive builders (see the macro invocations below) -- so we can just add three new enum variants to the primitive row builder enum and call it done.

}

pub(crate) enum StringVariantToArrowRowBuilder<'a> {
Utf8(VariantToUtf8ArrowRowBuilder<'a>),
LargeUtf8(VariantToLargeUtf8ArrowBuilder<'a>),
}

/// Builder for converting variant values into strongly typed Arrow arrays.
Expand All @@ -61,7 +68,7 @@ pub(crate) enum PrimitiveVariantToArrowRowBuilder<'a> {
pub(crate) enum VariantToArrowRowBuilder<'a> {
Primitive(PrimitiveVariantToArrowRowBuilder<'a>),
BinaryVariant(VariantToBinaryVariantArrowRowBuilder),

String(StringVariantToArrowRowBuilder<'a>),
// Path extraction wrapper - contains a boxed enum for any of the above
WithPath(VariantPathRowBuilder<'a>),
}
Expand All @@ -87,6 +94,7 @@ impl<'a> PrimitiveVariantToArrowRowBuilder<'a> {
TimestampNano(b) => b.append_null(),
TimestampNanoNtz(b) => b.append_null(),
Date(b) => b.append_null(),
StringView(b) => b.append_null(),
}
}

Expand All @@ -110,6 +118,7 @@ impl<'a> PrimitiveVariantToArrowRowBuilder<'a> {
TimestampNano(b) => b.append_value(value),
TimestampNanoNtz(b) => b.append_value(value),
Date(b) => b.append_value(value),
StringView(b) => b.append_value(value),
}
}

Expand All @@ -133,6 +142,33 @@ impl<'a> PrimitiveVariantToArrowRowBuilder<'a> {
TimestampNano(b) => b.finish(),
TimestampNanoNtz(b) => b.finish(),
Date(b) => b.finish(),
StringView(b) => b.finish(),
}
}
}

impl<'a> StringVariantToArrowRowBuilder<'a> {
pub fn append_null(&mut self) -> Result<()> {
use StringVariantToArrowRowBuilder::*;
match self {
Utf8(b) => b.append_null(),
LargeUtf8(b) => b.append_null(),
}
}
Comment on lines +150 to +157
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see any string-specific logic that would merit a nested enum like this?
Can we make this builder generic and use it in two new variants of the top-level enum?


pub fn append_value(&mut self, value: &Variant<'_, '_>) -> Result<bool> {
use StringVariantToArrowRowBuilder::*;
match self {
Utf8(b) => b.append_value(value),
LargeUtf8(b) => b.append_value(value),
}
}

pub fn finish(self) -> Result<ArrayRef> {
use StringVariantToArrowRowBuilder::*;
match self {
Utf8(b) => b.finish(),
LargeUtf8(b) => b.finish(),
}
}
}
Expand All @@ -142,6 +178,7 @@ impl<'a> VariantToArrowRowBuilder<'a> {
use VariantToArrowRowBuilder::*;
match self {
Primitive(b) => b.append_null(),
String(b) => b.append_null(),
BinaryVariant(b) => b.append_null(),
WithPath(path_builder) => path_builder.append_null(),
}
Expand All @@ -151,6 +188,7 @@ impl<'a> VariantToArrowRowBuilder<'a> {
use VariantToArrowRowBuilder::*;
match self {
Primitive(b) => b.append_value(&value),
String(b) => b.append_value(&value),
BinaryVariant(b) => b.append_value(value),
WithPath(path_builder) => path_builder.append_value(value),
}
Expand All @@ -160,6 +198,7 @@ impl<'a> VariantToArrowRowBuilder<'a> {
use VariantToArrowRowBuilder::*;
match self {
Primitive(b) => b.finish(),
String(b) => b.finish(),
BinaryVariant(b) => b.finish(),
WithPath(path_builder) => path_builder.finish(),
}
Expand Down Expand Up @@ -236,6 +275,9 @@ pub(crate) fn make_primitive_variant_to_arrow_row_builder<'a>(
cast_options,
capacity,
)),
DataType::Utf8View => {
StringView(VariantToUtf8ViewArrowBuilder::new(cast_options, capacity))
}
_ if data_type.is_primitive() => {
return Err(ArrowError::NotYetImplemented(format!(
"Primitive data_type {data_type:?} not yet implemented"
Expand All @@ -250,12 +292,41 @@ pub(crate) fn make_primitive_variant_to_arrow_row_builder<'a>(
Ok(builder)
}

pub(crate) fn make_string_variant_to_arrow_row_builder<'a>(
data_type: &'a DataType,
cast_options: &'a CastOptions,
item_capacity: usize,
data_capacity: usize,
) -> Result<StringVariantToArrowRowBuilder<'a>> {
use StringVariantToArrowRowBuilder::{LargeUtf8, Utf8};

let builder = match data_type {
DataType::Utf8 => Utf8(VariantToUtf8ArrowRowBuilder::new(
cast_options,
item_capacity,
data_capacity,
)),
DataType::LargeUtf8 => LargeUtf8(VariantToLargeUtf8ArrowBuilder::new(
cast_options,
item_capacity,
data_capacity,
)),
_ => {
return Err(ArrowError::InvalidArgumentError(format!(
"Not a string type: {data_type:?}"
)));
}
};
Ok(builder)
}

pub(crate) fn make_variant_to_arrow_row_builder<'a>(
metadata: &BinaryViewArray,
path: VariantPath<'a>,
data_type: Option<&'a DataType>,
cast_options: &'a CastOptions,
capacity: usize,
data_capacity: usize,
) -> Result<VariantToArrowRowBuilder<'a>> {
use VariantToArrowRowBuilder::*;

Expand All @@ -265,6 +336,15 @@ pub(crate) fn make_variant_to_arrow_row_builder<'a>(
metadata.clone(),
capacity,
)),
Some(dt @ (DataType::Utf8 | DataType::LargeUtf8)) => {
let builder = make_string_variant_to_arrow_row_builder(
dt,
cast_options,
capacity,
data_capacity,
)?;
String(builder)
}
Some(DataType::Struct(_)) => {
return Err(ArrowError::NotYetImplemented(
"Converting unshredded variant objects to arrow structs".to_string(),
Expand Down Expand Up @@ -401,6 +481,27 @@ macro_rules! define_variant_to_primitive_builder {
}
}

define_variant_to_primitive_builder!(
struct VariantToUtf8ArrowRowBuilder<'a>
|item_capacity, data_capacity: usize| -> StringBuilder { StringBuilder::with_capacity(item_capacity, data_capacity) },
|value| value.as_string(),
type_name: "Utf8"
);

define_variant_to_primitive_builder!(
struct VariantToLargeUtf8ArrowBuilder<'a>
|item_capacity, data_capacity: usize| -> LargeStringBuilder {LargeStringBuilder::with_capacity(item_capacity, data_capacity)},
|value| value.as_string(),
type_name: "LargeUtf8"
);

define_variant_to_primitive_builder!(
struct VariantToUtf8ViewArrowBuilder<'a>
|capacity| -> StringViewBuilder {StringViewBuilder::with_capacity(capacity)},
|value| value.as_string(),
type_name: "Utf8View"
);
Comment on lines +484 to +503
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check out the ListLikeArray trait in arrow_to_variant.rs -- I suspect a StringLikeArrayBuilder trait would be very helpful here, because the "shape" of all the string arrays is very similar even tho the specific array types and possibly some method names might differ).

Suggested change
define_variant_to_primitive_builder!(
struct VariantToUtf8ArrowRowBuilder<'a>
|item_capacity, data_capacity: usize| -> StringBuilder { StringBuilder::with_capacity(item_capacity, data_capacity) },
|value| value.as_string(),
type_name: "Utf8"
);
define_variant_to_primitive_builder!(
struct VariantToLargeUtf8ArrowBuilder<'a>
|item_capacity, data_capacity: usize| -> LargeStringBuilder {LargeStringBuilder::with_capacity(item_capacity, data_capacity)},
|value| value.as_string(),
type_name: "LargeUtf8"
);
define_variant_to_primitive_builder!(
struct VariantToUtf8ViewArrowBuilder<'a>
|capacity| -> StringViewBuilder {StringViewBuilder::with_capacity(capacity)},
|value| value.as_string(),
type_name: "Utf8View"
);
define_variant_to_primitive_builder!(
struct VariantToStringArrowBuilder<'a, B: StringLikeArrayBuilder>
|capacity| -> B { B::with_capacity(capacity) },
|value| value.as_string(),
type_name: B::type_name()
);

where

trait StringLikeArrayBuilder: ArrayBuilder {
    fn type_name() -> &'static str;
    fn with_capacity(capacity: usize) -> Self;
    fn append_value(&mut self, value: &str);
    fn append_null(&mut self);
}

impl StringLikeArrayBuilder for StringViewBuilder {
      ...
}

impl<O: OffsetSizeTrait> StringLikeArrayBuilder for GenericStringBuilder<O> {
      ...
    fn with_capacity(capacity: usize) -> Self {
        Self::with_capacity(capacity, capacity*AVERAGE_STRING_LENGTH)
    }
      ...
}

As noted in a different comment, we don't have any meaningful way to predict the needed data_capacity of a string builder, so IMO <GenericStringBuilder as StringLikeArrayBuilder>::with_capacity should just scale the item capacity by some reasonable guess at the average string size, and call that the data capacity.


define_variant_to_primitive_builder!(
struct VariantToBooleanArrowRowBuilder<'a>
|capacity| -> BooleanBuilder { BooleanBuilder::with_capacity(capacity) },
Expand Down Expand Up @@ -472,3 +573,84 @@ impl VariantToBinaryVariantArrowRowBuilder {
Ok(ArrayRef::from(variant_array))
}
}

#[cfg(test)]
mod test {
use crate::variant_to_arrow::make_variant_to_arrow_row_builder;
use arrow::array::{Array, BinaryViewArray, LargeStringArray, StringViewArray};
use arrow::compute::CastOptions;
use arrow::datatypes::DataType;
use parquet_variant::EMPTY_VARIANT_METADATA_BYTES;
use parquet_variant::{Variant, VariantPath};
use std::sync::Arc;

#[test]
fn test_large_utf8_variant_to_large_utf8_arrow() {
let metadata =
BinaryViewArray::from_iter_values(std::iter::repeat_n(EMPTY_VARIANT_METADATA_BYTES, 3));
let cast_options = CastOptions::default();
let path = VariantPath::default();
let capacity = 3;
let data_capacity = 9usize;
let data_type = Some(&DataType::LargeUtf8);

let mut builder = make_variant_to_arrow_row_builder(
&metadata,
path,
data_type,
&cast_options,
capacity,
data_capacity,
)
.unwrap();

builder.append_value(Variant::from("foo")).unwrap();
builder.append_value(Variant::from("bar")).unwrap();
builder.append_value(Variant::from("baz")).unwrap();

let output = builder.finish().unwrap();

let expected = Arc::new(LargeStringArray::from(vec![
Some("foo"),
Some("bar"),
Some("baz"),
]));

assert_eq!(&output.to_data(), &expected.to_data());
}

#[test]
fn test_utf8_view_variant_to_utf8_view_arrow() {
let metadata =
BinaryViewArray::from_iter_values(std::iter::repeat_n(EMPTY_VARIANT_METADATA_BYTES, 3));
let cast_options = CastOptions::default();
let path = VariantPath::default();
let capacity = 3;
let data_capacity = 9usize;
let data_type = Some(&DataType::Utf8View);

let mut builder = make_variant_to_arrow_row_builder(
&metadata,
path,
data_type,
&cast_options,
capacity,
data_capacity,
)
.unwrap();

builder.append_value(Variant::from("foo")).unwrap();
builder.append_value(Variant::from("bar")).unwrap();
builder.append_value(Variant::from("baz")).unwrap();

let output = builder.finish().unwrap();

let expected = Arc::new(StringViewArray::from(vec![
Some("foo"),
Some("bar"),
Some("baz"),
]));

assert_eq!(&output.to_data(), &expected.to_data());
}
}
Loading