-
Notifications
You must be signed in to change notification settings - Fork 1k
Variant to arrow utf8 #8600
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Variant to arrow utf8 #8600
Conversation
|
||
perfectly_shredded_to_arrow_primitive_test!( | ||
get_variant_perfectly_shredded_utf8_as_utf8, | ||
DataType::Utf8, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to add tests for other types(LargeUtf8/Utf8View
) here?
The test here wants to cover the variant_get
logic, and the tests added in variant_to_arrow.rs
were to cover the logic of the builder?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shredding is not supported for LargeUtf8/Utf8View'
per specification.
I originally added the tests for them inside variant_get
but got the error saying these types do not support shredding.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would be from the VariantArray
constructor, which invokes this code:
fn canonicalize_and_verify_data_type(
data_type: &DataType,
) -> Result<Cow<'_, DataType>, ArrowError> {
...
let new_data_type = match data_type {
...
// We can _possibly_ allow (some of) these some day?
LargeBinary | LargeUtf8 | Utf8View | ListView(_) | LargeList(_) | LargeListView(_) => {
fail!()
}
I originally added that code because I was not confident I knew what the correct behavior should be. The shredding spec says:
Shredded values must use the following Parquet types:
Variant Type Parquet Physical Type Parquet Logical Type ... binary BINARY string BINARY STRING ... array GROUP; see Arrays below LIST
But I'm pretty sure that doesn't need to constrain the use of DataType::Utf8
vs. DataType::LargeUtf8
vs DataType::Utf8Vew
? (similar story for the various in-memory layouts of lists and binary values)?
A similar dilemma is that the metadata
column is supposed to be parquet BINARY type, but arrow-parquet produces BinaryViewArray
by default. Right now we replace DataType::Binary
with DataType::BinaryView
and force a cast as needed.
If we think the shredding spec forbids LargeUtf8
or Utf8View
then we probably need to cast binary views back to normal binary as well.
If we don't think the shredding spec forbids those types, then we should probably support metadata: LargeBinaryArray
(tho the narrowing cast to BinaryArray
might fail if the offsets really don't fit in 32 bits).
TimestampNano(VariantToTimestampArrowRowBuilder<'a, datatypes::TimestampNanosecondType>), | ||
TimestampNanoNtz(VariantToTimestampNtzArrowRowBuilder<'a, datatypes::TimestampNanosecondType>), | ||
Date(VariantToPrimitiveArrowRowBuilder<'a, datatypes::Date32Type>), | ||
StringView(VariantToUtf8ViewArrowBuilder<'a>), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We added StringView
to the PrimitiveVariantToArrowRowBuilder
and the other two to StringVariantToArrowRowBuilder
, is there a particular reason for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Allocating memory for primitive builders only requires a capacity
field for the number of items to pre-allocate.
For Utf8/LargeUtf8
builders capacity
and another field are required: data_capacity
- the total number of (utf8) bytes to allocate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see any meaningful call sites that pass a data capacity -- only some unit tests.
Ultimately, variant_get
will call make_variant_to_arrow_row_builder
, and I don't think that code has any way to predict what the correct data capacity might be? How could one even define "correct" when a single value would be applied to each of potentially many string row builders that will be created, when each of those builders could see a completely different distribution of string sizes and null values?
This is very different from the row capacity value, which IS precisely known and applies equally to all builders variant_get
might need to create.
Also -- these capacities are just pre-allocation hints; passing too large a hint temporarily wastes a bit of memory, and passing too small a hint just means one or more internal reallocations.
I would vote to just choose a reasonable default "average string size" and multiply that by the row count to obtain a data capacity hint when needed.
TBD whether that average string size should be a parameter that originates with the caller of variant_get
and gets plumbed all the way through -- but that seems like a really invasive API change for very little benefit. Seems like a simple const
would be much better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A big benefit of the simpler approach to data capacity: All the string builders are, in fact, primitive builders (see the macro invocations below) -- so we can just add three new enum variants to the primitive row builder enum and call it done.
Co-authored-by: Congxian Qiu <qcx978132955@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for tackling this! It seems to uncover a couple issues that might need some guidance from experts, see comments.
impl<'a> StringVariantToArrowRowBuilder<'a> { | ||
pub fn append_null(&mut self) -> Result<()> { | ||
use StringVariantToArrowRowBuilder::*; | ||
match self { | ||
Utf8(b) => b.append_null(), | ||
LargeUtf8(b) => b.append_null(), | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see any string-specific logic that would merit a nested enum like this?
Can we make this builder generic and use it in two new variants of the top-level enum?
define_variant_to_primitive_builder!( | ||
struct VariantToUtf8ArrowRowBuilder<'a> | ||
|item_capacity, data_capacity: usize| -> StringBuilder { StringBuilder::with_capacity(item_capacity, data_capacity) }, | ||
|value| value.as_string(), | ||
type_name: "Utf8" | ||
); | ||
|
||
define_variant_to_primitive_builder!( | ||
struct VariantToLargeUtf8ArrowBuilder<'a> | ||
|item_capacity, data_capacity: usize| -> LargeStringBuilder {LargeStringBuilder::with_capacity(item_capacity, data_capacity)}, | ||
|value| value.as_string(), | ||
type_name: "LargeUtf8" | ||
); | ||
|
||
define_variant_to_primitive_builder!( | ||
struct VariantToUtf8ViewArrowBuilder<'a> | ||
|capacity| -> StringViewBuilder {StringViewBuilder::with_capacity(capacity)}, | ||
|value| value.as_string(), | ||
type_name: "Utf8View" | ||
); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Check out the ListLikeArray
trait in arrow_to_variant.rs -- I suspect a StringLikeArrayBuilder
trait would be very helpful here, because the "shape" of all the string arrays is very similar even tho the specific array types and possibly some method names might differ).
define_variant_to_primitive_builder!( | |
struct VariantToUtf8ArrowRowBuilder<'a> | |
|item_capacity, data_capacity: usize| -> StringBuilder { StringBuilder::with_capacity(item_capacity, data_capacity) }, | |
|value| value.as_string(), | |
type_name: "Utf8" | |
); | |
define_variant_to_primitive_builder!( | |
struct VariantToLargeUtf8ArrowBuilder<'a> | |
|item_capacity, data_capacity: usize| -> LargeStringBuilder {LargeStringBuilder::with_capacity(item_capacity, data_capacity)}, | |
|value| value.as_string(), | |
type_name: "LargeUtf8" | |
); | |
define_variant_to_primitive_builder!( | |
struct VariantToUtf8ViewArrowBuilder<'a> | |
|capacity| -> StringViewBuilder {StringViewBuilder::with_capacity(capacity)}, | |
|value| value.as_string(), | |
type_name: "Utf8View" | |
); | |
define_variant_to_primitive_builder!( | |
struct VariantToStringArrowBuilder<'a, B: StringLikeArrayBuilder> | |
|capacity| -> B { B::with_capacity(capacity) }, | |
|value| value.as_string(), | |
type_name: B::type_name() | |
); |
where
trait StringLikeArrayBuilder: ArrayBuilder {
fn type_name() -> &'static str;
fn with_capacity(capacity: usize) -> Self;
fn append_value(&mut self, value: &str);
fn append_null(&mut self);
}
impl StringLikeArrayBuilder for StringViewBuilder {
...
}
impl<O: OffsetSizeTrait> StringLikeArrayBuilder for GenericStringBuilder<O> {
...
fn with_capacity(capacity: usize) -> Self {
Self::with_capacity(capacity, capacity*AVERAGE_STRING_LENGTH)
}
...
}
As noted in a different comment, we don't have any meaningful way to predict the needed data_capacity of a string builder, so IMO <GenericStringBuilder as StringLikeArrayBuilder>::with_capacity
should just scale the item capacity by some reasonable guess at the average string size, and call that the data capacity.
TimestampNano(VariantToTimestampArrowRowBuilder<'a, datatypes::TimestampNanosecondType>), | ||
TimestampNanoNtz(VariantToTimestampNtzArrowRowBuilder<'a, datatypes::TimestampNanosecondType>), | ||
Date(VariantToPrimitiveArrowRowBuilder<'a, datatypes::Date32Type>), | ||
StringView(VariantToUtf8ViewArrowBuilder<'a>), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see any meaningful call sites that pass a data capacity -- only some unit tests.
Ultimately, variant_get
will call make_variant_to_arrow_row_builder
, and I don't think that code has any way to predict what the correct data capacity might be? How could one even define "correct" when a single value would be applied to each of potentially many string row builders that will be created, when each of those builders could see a completely different distribution of string sizes and null values?
This is very different from the row capacity value, which IS precisely known and applies equally to all builders variant_get
might need to create.
Also -- these capacities are just pre-allocation hints; passing too large a hint temporarily wastes a bit of memory, and passing too small a hint just means one or more internal reallocations.
I would vote to just choose a reasonable default "average string size" and multiply that by the row count to obtain a data capacity hint when needed.
TBD whether that average string size should be a parameter that originates with the caller of variant_get
and gets plumbed all the way through -- but that seems like a really invasive API change for very little benefit. Seems like a simple const
would be much better.
TimestampNano(VariantToTimestampArrowRowBuilder<'a, datatypes::TimestampNanosecondType>), | ||
TimestampNanoNtz(VariantToTimestampNtzArrowRowBuilder<'a, datatypes::TimestampNanosecondType>), | ||
Date(VariantToPrimitiveArrowRowBuilder<'a, datatypes::Date32Type>), | ||
StringView(VariantToUtf8ViewArrowBuilder<'a>), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A big benefit of the simpler approach to data capacity: All the string builders are, in fact, primitive builders (see the macro invocations below) -- so we can just add three new enum variants to the primitive row builder enum and call it done.
|
||
perfectly_shredded_to_arrow_primitive_test!( | ||
get_variant_perfectly_shredded_utf8_as_utf8, | ||
DataType::Utf8, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would be from the VariantArray
constructor, which invokes this code:
fn canonicalize_and_verify_data_type(
data_type: &DataType,
) -> Result<Cow<'_, DataType>, ArrowError> {
...
let new_data_type = match data_type {
...
// We can _possibly_ allow (some of) these some day?
LargeBinary | LargeUtf8 | Utf8View | ListView(_) | LargeList(_) | LargeListView(_) => {
fail!()
}
I originally added that code because I was not confident I knew what the correct behavior should be. The shredding spec says:
Shredded values must use the following Parquet types:
Variant Type Parquet Physical Type Parquet Logical Type ... binary BINARY string BINARY STRING ... array GROUP; see Arrays below LIST
But I'm pretty sure that doesn't need to constrain the use of DataType::Utf8
vs. DataType::LargeUtf8
vs DataType::Utf8Vew
? (similar story for the various in-memory layouts of lists and binary values)?
A similar dilemma is that the metadata
column is supposed to be parquet BINARY type, but arrow-parquet produces BinaryViewArray
by default. Right now we replace DataType::Binary
with DataType::BinaryView
and force a cast as needed.
If we think the shredding spec forbids LargeUtf8
or Utf8View
then we probably need to cast binary views back to normal binary as well.
If we don't think the shredding spec forbids those types, then we should probably support metadata: LargeBinaryArray
(tho the narrowing cast to BinaryArray
might fail if the offsets really don't fit in 32 bits).
Which issue does this PR close?
Rationale for this change
Add support for Variant::Utf-8, LargeUtf8, Utf8View. This needs to add a new builder VariantToStringArrowRowBuilder, because LargeUtf8, Utf8View are not ArrowPritimitiveType's
What changes are included in this PR?
data_capacity
tomake_string_variant_to_arrow_row_builder
to support string types.make_string_variant_to_arrow_row_builder
invariant_get
to include the variable.Are these changes tested?
Added a variant_get test for utf8 type and created two separate tests for largeUtf8 and Utf8view because these types can't be shredded.
Are there any user-facing changes?
No