-
Notifications
You must be signed in to change notification settings - Fork 1k
Variant to arrow utf8 #8600
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Variant to arrow utf8 #8600
Changes from 3 commits
f6e88ef
61ed178
1fb612d
2b6d280
398b52d
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -16,7 +16,8 @@ | |
// under the License. | ||
|
||
use arrow::array::{ | ||
ArrayRef, BinaryViewArray, NullBufferBuilder, PrimitiveBuilder, builder::BooleanBuilder, | ||
ArrayRef, BinaryViewArray, LargeStringBuilder, NullBufferBuilder, PrimitiveBuilder, | ||
StringBuilder, StringViewBuilder, builder::BooleanBuilder, | ||
}; | ||
use arrow::compute::CastOptions; | ||
use arrow::datatypes::{self, ArrowPrimitiveType, DataType}; | ||
|
@@ -52,6 +53,12 @@ pub(crate) enum PrimitiveVariantToArrowRowBuilder<'a> { | |
TimestampNano(VariantToTimestampArrowRowBuilder<'a, datatypes::TimestampNanosecondType>), | ||
TimestampNanoNtz(VariantToTimestampNtzArrowRowBuilder<'a, datatypes::TimestampNanosecondType>), | ||
Date(VariantToPrimitiveArrowRowBuilder<'a, datatypes::Date32Type>), | ||
StringView(VariantToUtf8ViewArrowBuilder<'a>), | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We added There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Allocating memory for primitive builders only requires a For There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't see any meaningful call sites that pass a data capacity -- only some unit tests. Ultimately, This is very different from the row capacity value, which IS precisely known and applies equally to all builders Also -- these capacities are just pre-allocation hints; passing too large a hint temporarily wastes a bit of memory, and passing too small a hint just means one or more internal reallocations. I would vote to just choose a reasonable default "average string size" and multiply that by the row count to obtain a data capacity hint when needed. TBD whether that average string size should be a parameter that originates with the caller of There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. A big benefit of the simpler approach to data capacity: All the string builders are, in fact, primitive builders (see the macro invocations below) -- so we can just add three new enum variants to the primitive row builder enum and call it done. |
||
} | ||
|
||
pub(crate) enum StringVariantToArrowRowBuilder<'a> { | ||
Utf8(VariantToUtf8ArrowRowBuilder<'a>), | ||
LargeUtf8(VariantToLargeUtf8ArrowBuilder<'a>), | ||
} | ||
|
||
/// Builder for converting variant values into strongly typed Arrow arrays. | ||
|
@@ -61,7 +68,7 @@ pub(crate) enum PrimitiveVariantToArrowRowBuilder<'a> { | |
pub(crate) enum VariantToArrowRowBuilder<'a> { | ||
Primitive(PrimitiveVariantToArrowRowBuilder<'a>), | ||
BinaryVariant(VariantToBinaryVariantArrowRowBuilder), | ||
|
||
String(StringVariantToArrowRowBuilder<'a>), | ||
// Path extraction wrapper - contains a boxed enum for any of the above | ||
WithPath(VariantPathRowBuilder<'a>), | ||
} | ||
|
@@ -87,6 +94,7 @@ impl<'a> PrimitiveVariantToArrowRowBuilder<'a> { | |
TimestampNano(b) => b.append_null(), | ||
TimestampNanoNtz(b) => b.append_null(), | ||
Date(b) => b.append_null(), | ||
StringView(b) => b.append_null(), | ||
} | ||
} | ||
|
||
|
@@ -110,6 +118,7 @@ impl<'a> PrimitiveVariantToArrowRowBuilder<'a> { | |
TimestampNano(b) => b.append_value(value), | ||
TimestampNanoNtz(b) => b.append_value(value), | ||
Date(b) => b.append_value(value), | ||
StringView(b) => b.append_value(value), | ||
} | ||
} | ||
|
||
|
@@ -133,6 +142,33 @@ impl<'a> PrimitiveVariantToArrowRowBuilder<'a> { | |
TimestampNano(b) => b.finish(), | ||
TimestampNanoNtz(b) => b.finish(), | ||
Date(b) => b.finish(), | ||
StringView(b) => b.finish(), | ||
} | ||
} | ||
} | ||
|
||
impl<'a> StringVariantToArrowRowBuilder<'a> { | ||
pub fn append_null(&mut self) -> Result<()> { | ||
use StringVariantToArrowRowBuilder::*; | ||
match self { | ||
Utf8(b) => b.append_null(), | ||
LargeUtf8(b) => b.append_null(), | ||
} | ||
} | ||
Comment on lines
+150
to
+157
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't see any string-specific logic that would merit a nested enum like this? |
||
|
||
pub fn append_value(&mut self, value: &Variant<'_, '_>) -> Result<bool> { | ||
use StringVariantToArrowRowBuilder::*; | ||
match self { | ||
Utf8(b) => b.append_value(value), | ||
LargeUtf8(b) => b.append_value(value), | ||
} | ||
} | ||
|
||
pub fn finish(self) -> Result<ArrayRef> { | ||
use StringVariantToArrowRowBuilder::*; | ||
match self { | ||
Utf8(b) => b.finish(), | ||
LargeUtf8(b) => b.finish(), | ||
} | ||
} | ||
} | ||
|
@@ -142,6 +178,7 @@ impl<'a> VariantToArrowRowBuilder<'a> { | |
use VariantToArrowRowBuilder::*; | ||
match self { | ||
Primitive(b) => b.append_null(), | ||
String(b) => b.append_null(), | ||
BinaryVariant(b) => b.append_null(), | ||
WithPath(path_builder) => path_builder.append_null(), | ||
} | ||
|
@@ -151,6 +188,7 @@ impl<'a> VariantToArrowRowBuilder<'a> { | |
use VariantToArrowRowBuilder::*; | ||
match self { | ||
Primitive(b) => b.append_value(&value), | ||
String(b) => b.append_value(&value), | ||
BinaryVariant(b) => b.append_value(value), | ||
WithPath(path_builder) => path_builder.append_value(value), | ||
} | ||
|
@@ -160,6 +198,7 @@ impl<'a> VariantToArrowRowBuilder<'a> { | |
use VariantToArrowRowBuilder::*; | ||
match self { | ||
Primitive(b) => b.finish(), | ||
String(b) => b.finish(), | ||
BinaryVariant(b) => b.finish(), | ||
WithPath(path_builder) => path_builder.finish(), | ||
} | ||
|
@@ -236,6 +275,9 @@ pub(crate) fn make_primitive_variant_to_arrow_row_builder<'a>( | |
cast_options, | ||
capacity, | ||
)), | ||
DataType::Utf8View => { | ||
StringView(VariantToUtf8ViewArrowBuilder::new(cast_options, capacity)) | ||
} | ||
_ if data_type.is_primitive() => { | ||
return Err(ArrowError::NotYetImplemented(format!( | ||
"Primitive data_type {data_type:?} not yet implemented" | ||
|
@@ -250,12 +292,41 @@ pub(crate) fn make_primitive_variant_to_arrow_row_builder<'a>( | |
Ok(builder) | ||
} | ||
|
||
pub(crate) fn make_string_variant_to_arrow_row_builder<'a>( | ||
data_type: &'a DataType, | ||
cast_options: &'a CastOptions, | ||
item_capacity: usize, | ||
data_capacity: usize, | ||
) -> Result<StringVariantToArrowRowBuilder<'a>> { | ||
use StringVariantToArrowRowBuilder::{LargeUtf8, Utf8}; | ||
|
||
let builder = match data_type { | ||
DataType::Utf8 => Utf8(VariantToUtf8ArrowRowBuilder::new( | ||
cast_options, | ||
item_capacity, | ||
data_capacity, | ||
)), | ||
DataType::LargeUtf8 => LargeUtf8(VariantToLargeUtf8ArrowBuilder::new( | ||
cast_options, | ||
item_capacity, | ||
data_capacity, | ||
)), | ||
_ => { | ||
return Err(ArrowError::InvalidArgumentError(format!( | ||
"Not a string type: {data_type:?}" | ||
))); | ||
} | ||
}; | ||
Ok(builder) | ||
} | ||
|
||
pub(crate) fn make_variant_to_arrow_row_builder<'a>( | ||
metadata: &BinaryViewArray, | ||
path: VariantPath<'a>, | ||
data_type: Option<&'a DataType>, | ||
cast_options: &'a CastOptions, | ||
capacity: usize, | ||
data_capacity: usize, | ||
) -> Result<VariantToArrowRowBuilder<'a>> { | ||
use VariantToArrowRowBuilder::*; | ||
|
||
|
@@ -265,6 +336,15 @@ pub(crate) fn make_variant_to_arrow_row_builder<'a>( | |
metadata.clone(), | ||
capacity, | ||
)), | ||
Some(dt @ (DataType::Utf8 | DataType::LargeUtf8)) => { | ||
let builder = make_string_variant_to_arrow_row_builder( | ||
dt, | ||
cast_options, | ||
capacity, | ||
data_capacity, | ||
)?; | ||
String(builder) | ||
} | ||
Some(DataType::Struct(_)) => { | ||
return Err(ArrowError::NotYetImplemented( | ||
"Converting unshredded variant objects to arrow structs".to_string(), | ||
|
@@ -401,6 +481,27 @@ macro_rules! define_variant_to_primitive_builder { | |
} | ||
} | ||
|
||
define_variant_to_primitive_builder!( | ||
struct VariantToUtf8ArrowRowBuilder<'a> | ||
|item_capacity, data_capacity: usize| -> StringBuilder {StringBuilder::with_capacity(item_capacity, data_capacity)}, | ||
sdf-jkl marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|value| value.as_string(), | ||
type_name: "Utf8" | ||
); | ||
|
||
define_variant_to_primitive_builder!( | ||
struct VariantToLargeUtf8ArrowBuilder<'a> | ||
|item_capacity, data_capacity: usize| -> LargeStringBuilder {LargeStringBuilder::with_capacity(item_capacity, data_capacity)}, | ||
|value| value.as_string(), | ||
type_name: "LargeUtf8" | ||
); | ||
|
||
define_variant_to_primitive_builder!( | ||
struct VariantToUtf8ViewArrowBuilder<'a> | ||
|capacity| -> StringViewBuilder {StringViewBuilder::with_capacity(capacity)}, | ||
|value| value.as_string(), | ||
type_name: "Utf8View" | ||
); | ||
|
||
define_variant_to_primitive_builder!( | ||
struct VariantToBooleanArrowRowBuilder<'a> | ||
|capacity| -> BooleanBuilder { BooleanBuilder::with_capacity(capacity) }, | ||
|
@@ -472,3 +573,84 @@ impl VariantToBinaryVariantArrowRowBuilder { | |
Ok(ArrayRef::from(variant_array)) | ||
} | ||
} | ||
|
||
#[cfg(test)] | ||
mod test { | ||
use crate::variant_to_arrow::make_variant_to_arrow_row_builder; | ||
use arrow::array::{Array, BinaryViewArray, LargeStringArray, StringViewArray}; | ||
use arrow::compute::CastOptions; | ||
use arrow::datatypes::DataType; | ||
use parquet_variant::EMPTY_VARIANT_METADATA_BYTES; | ||
use parquet_variant::{Variant, VariantPath}; | ||
use std::sync::Arc; | ||
|
||
#[test] | ||
fn test_large_utf8_variant_to_large_utf8_arrow() { | ||
let metadata = | ||
BinaryViewArray::from_iter_values(std::iter::repeat_n(EMPTY_VARIANT_METADATA_BYTES, 3)); | ||
let cast_options = CastOptions::default(); | ||
let path = VariantPath::default(); | ||
let capacity = 3; | ||
let data_capacity = 9usize; | ||
let data_type = Some(&DataType::LargeUtf8); | ||
|
||
let mut builder = make_variant_to_arrow_row_builder( | ||
&metadata, | ||
path, | ||
data_type, | ||
&cast_options, | ||
capacity, | ||
data_capacity, | ||
) | ||
.unwrap(); | ||
|
||
builder.append_value(Variant::from("foo")).unwrap(); | ||
builder.append_value(Variant::from("bar")).unwrap(); | ||
builder.append_value(Variant::from("baz")).unwrap(); | ||
|
||
let output = builder.finish().unwrap(); | ||
|
||
let expected = Arc::new(LargeStringArray::from(vec![ | ||
Some("foo"), | ||
Some("bar"), | ||
Some("baz"), | ||
])); | ||
|
||
assert_eq!(&output.to_data(), &expected.to_data()); | ||
} | ||
|
||
#[test] | ||
fn test_utf8_view_variant_to_utf8_view_arrow() { | ||
let metadata = | ||
BinaryViewArray::from_iter_values(std::iter::repeat_n(EMPTY_VARIANT_METADATA_BYTES, 3)); | ||
let cast_options = CastOptions::default(); | ||
let path = VariantPath::default(); | ||
let capacity = 3; | ||
let data_capacity = 9usize; | ||
let data_type = Some(&DataType::Utf8View); | ||
|
||
let mut builder = make_variant_to_arrow_row_builder( | ||
&metadata, | ||
path, | ||
data_type, | ||
&cast_options, | ||
capacity, | ||
data_capacity, | ||
) | ||
.unwrap(); | ||
|
||
builder.append_value(Variant::from("foo")).unwrap(); | ||
builder.append_value(Variant::from("bar")).unwrap(); | ||
builder.append_value(Variant::from("baz")).unwrap(); | ||
|
||
let output = builder.finish().unwrap(); | ||
|
||
let expected = Arc::new(StringViewArray::from(vec![ | ||
Some("foo"), | ||
Some("bar"), | ||
Some("baz"), | ||
])); | ||
|
||
assert_eq!(&output.to_data(), &expected.to_data()); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to add tests for other types(
LargeUtf8/Utf8View
) here?The test here wants to cover the
variant_get
logic, and the tests added invariant_to_arrow.rs
were to cover the logic of the builder?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shredding is not supported for
LargeUtf8/Utf8View'
per specification.I originally added the tests for them inside
variant_get
but got the error saying these types do not support shredding.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would be from the
VariantArray
constructor, which invokes this code:I originally added that code because I was not confident I knew what the correct behavior should be. The shredding spec says:
But I'm pretty sure that doesn't need to constrain the use of
DataType::Utf8
vs.DataType::LargeUtf8
vsDataType::Utf8Vew
? (similar story for the various in-memory layouts of lists and binary values)?A similar dilemma is that the
metadata
column is supposed to be parquet BINARY type, but arrow-parquet producesBinaryViewArray
by default. Right now we replaceDataType::Binary
withDataType::BinaryView
and force a cast as needed.If we think the shredding spec forbids
LargeUtf8
orUtf8View
then we probably need to cast binary views back to normal binary as well.If we don't think the shredding spec forbids those types, then we should probably support
metadata: LargeBinaryArray
(tho the narrowing cast toBinaryArray
might fail if the offsets really don't fit in 32 bits).@alamb @cashmand, any advice here?