-
Notifications
You must be signed in to change notification settings - Fork 976
[Variant] VariantBuilder with VariantMetadata instead of MetadataBuilder #7915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Attn @alamb @friendlymatthew - I'd love to know your initial reactions to this change |
if self.fields.insert(field_id, field_start).is_some() && self.validate_unique_fields { | ||
self.duplicate_fields.insert(field_id); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that insert
is fallible, we can fail-fast instead of tracking a set of duplicate fields for finish
to worry about:
if self.fields.insert(field_id, field_start).is_some() && self.validate_unique_fields { | |
self.duplicate_fields.insert(field_id); | |
} | |
if self.fields.insert(field_id, field_start).is_some() && self.validate_unique_fields { | |
return Err(... duplicate field {key} ...); | |
} |
I have some proposal of that API in this PR |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @scovich -- it is neat to see progress towards shredding
I don't fully understand where you are heading with this PR -- specifically it seems like the assumption in this PR is that we would use a VariantBuilder
to recreate a shredded Variant.
What I was imagining is that we are able to return a Variant
that is a read only view into the underlying buffer(s) from VariantArray::value()
-- we may have to add to the Variant
enum like Variant::ShreddedObject
or something
However, if we can figure that out, then I think the "unshredding" operaton simply looks like calling VariantArray::value()
and then making a builder to write into a new VariantArray (without shredding)
Now that @friendlymatthew is close to supporting appending complex variants directly in the builder, I think it could be quite elegant
Maybe once we merge #7905 we could work on figuring out how to access shredded variants (aka manually construct StructArrays) -- and then once we had that worked out, we can work out how to unshred them
fn extend<T: IntoIterator<Item = S>>(&mut self, iter: T) { | ||
for field_name in iter { | ||
self.upsert_field_name(field_name.as_ref()); | ||
} | ||
} | ||
} | ||
|
||
/// Read-only metadata builder that validates field names against an existing metadata dictionary |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't really understand the usecase for ReadOnlyMetadataBuilder? If a shredded variant is already guaranteed to have the correct field names, it means we don't need to update the variant or build a new one so I am not sure even how a VariantBuilder would be involved 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An unshredding operation requires two kinds of variant "building":
- Shredded fields need a full blown variant builder, because they're strongly typed and we need to encode them as variant. But the existing
VariantMetadata
already contains all their field names, so the builder only needs to look up their field ids. The spec requires the ability to unshred using a read-only metadata dictionary, so why go copying it? - Unshredded fields can be copied as-is (I would honestly prefer copying raw bytes instead of parsing and then unparsing them the way [Variant] Support appending complex variants in
VariantBuilder
#7914 seems to favor). We need a builder to keep track of the updated offsets as we interleave existing variant value bytes with newly-unshredded value bytes.
Imagine a partially shredded variant column v
:
- Fields
a
,m
andx
live in thetyped_value
column as a perfectly shredded struct- Furthermore,
x
is itself a shredded struct with its own fieldsi
andj
- Furthermore,
i
is a variant column (also using the same top-level metadata) - Furthermore,
j
is a partially shredded struct
- Furthermore,
- Furthermore,
- Fields
b
,n
, andy
live in thevalue
column as a variant object
When unshredding, we need to turn a
, b
, and x
from strongly typed values into variant objects -- the latter recursively so -- which requires a variant builder. The recursive unshredding of j
also requires a builder of its own.
Once we have all the field values in variant form, we need to create a new variant object for v
. We can copy the bytes of b
, n
and y
as a contiguous block if we want -- relying on field id sorting to give the required ordering a
, b
, m
, n
, x
, y
-- but we do have to copy those bytes in order to inject the newly created variant value bytes for a
, m
, and x
. Plus, it's entirely possible other newly-unshredded sibling values before and after v
would have required copying its value bytes even if v
were completely unshredded.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shredded fields need a full blown variant builder, because they're strongly typed and we need to encode them as variant.
When unshredding, we need to turn a, b, and x from strongly typed values into variant objects -- the latter recursively so -- which requires a variant builder. The recursive unshredding of j also requires a builder of its own.
I see -- you are imagining unshredding the varaint phiscally (aka forming the bytes for a new unshredded variant value
using a VariantBuilder
) to access the shredded fields.
What if we made a view into the variant so we didn't have to copy anything to read the Variant? If we could get a Variant
that accessed the shredded fields, physically unshredding becomes an exercise in creating a new builder
and calling append_value
:
Hypothetical unshred
kernel
// Given a shredded variant that is itself a Variant and a view on the shredded columns
fn unshred(shredded_variant: Varaint) -> (Vec<u8>, Vec<u8>) {
// deep copy the shredded variant
let mut variant_builder = VariantBuilder::new()
variant_builder.append_value(shredded_variant)
}
Shredded Variants as "Views"
The question then is how to implement a view. Accessing primitive types as a Variant is straightforward
fn shredded_int_as_variant(value: i64) -> Variant {
Varant::from(value)
}
// ... other primitive types here ..
fn shredded_str_as_variant(value: &str) -> Variant {
Variant::from(value)
}
The tricky bit is accessing partially shredded objects. Maybe we could add the shredded fields to VariantObject
, something like
struct VariantObject {
// ... existing fields ..
// fields that were shredded and live elsewhere
shredded_fields: IndexMap<&str, Variant>
}
Then we can update all the accessor methods, etc.
An alternate mechanism that might be more explicit would a new explicit enumeration in Variant
like
struct ShreddedObject {
object: VariantObject,
shredded_fields: IndexMap<&str, Variant>
}
enum Variant {
ShreddedObject(..)
}
We could Box
the shredded fields to keep the size of Variant down if needed
Hows does this handle the example?
Imagine a partially shredded variant column
v
:
Fields
a
,m
andx
live in thetyped_value
column as a perfectly shredded struct
Furthermore,
x
is itself a shredded struct with its own fieldsi
andj
- Furthermore,
i
is a variant column (also using the same top-level metadata)- Furthermore,
j
is a partially shredded structFields
b
,n
, andy
live in thevalue
column as a variant object
I think the view approach could handle this example. For example
v
is aVariant::Object(..)
withshredded_fields
v
's fields in thevalue_column
containb
,n
andy
v::shredded_fields
contains theVariant
s fora
,m
, andx
-
`x` is a `VariantObject()` ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, how do we plan on storing typed_value
s? Do we plan on encoding it as a Variant
and storing the binary representation of it? Or since it is supposed to be strongly typed, we want to map to specific DataType
s, which would require something like #7921
I've been playing around with the unshredding logic and the most naive version would look something like this. I don't mean this to be the actual implementation, but I am curious how we plan on representing typed_value
.
naive reconstruct logic
pub fn reconstruct_variant(
metadata: &[u8],
value: Option<&[u8]>,
typed_value: Option<(&[u8], &[u8])>, // this is itself a variant
) -> Result<(Vec<u8>, Vec<u8>), ArrowError> {
match typed_value {
Some((typed_metadata, typed_value)) => {
let variant = Variant::try_new(typed_metadata, typed_value)?;
match variant {
Variant::Null
| Variant::Int8(_)
| Variant::Int16(_)
| Variant::Int32(_)
| Variant::Int64(_)
| Variant::Date(_)
| Variant::TimestampMicros(_)
| Variant::TimestampNtzMicros(_)
| Variant::Decimal4(_)
| Variant::Decimal8(_)
| Variant::Decimal16(_)
| Variant::Float(_)
| Variant::Double(_)
| Variant::BooleanTrue
| Variant::BooleanFalse
| Variant::Binary(_)
| Variant::String(_)
| Variant::ShortString(_) => Ok((typed_metadata.to_vec(), typed_value.to_vec())),
Variant::Object(shredded_object) => {
if let Some(value) = value {
let variant = Variant::try_new(metadata, value)?;
let partially_shredded_object =
variant.as_object().ok_or(ArrowError::InvalidArgumentError(
"partially shredded value must be an object".to_string(),
))?;
let shredded_keys: HashSet<&str> =
HashSet::from_iter(shredded_object.iter().map(|(k, _)| k));
let partially_shredded_keys: HashSet<&str> =
HashSet::from_iter(partially_shredded_object.iter().map(|(k, _)| k));
if !shredded_keys.is_disjoint(&partially_shredded_keys) {
return Err(ArrowError::InvalidArgumentError(
"object keys must be disjoint".to_string(),
));
}
// union the two objects together
let mut variant_builder = VariantBuilder::new();
let mut object_builder = variant_builder.new_object();
for (f, v) in shredded_object.iter() {
object_builder.insert(f, v);
}
for (f, v) in partially_shredded_object.iter() {
object_builder.insert(f, v);
}
object_builder.finish()?;
return Ok(variant_builder.finish());
}
Ok((typed_metadata.to_vec(), typed_value.to_vec()))
}
Variant::List(_) => {
if value.is_some() {
return Err(ArrowError::InvalidArgumentError(
"shredded array must not conflict with variant value".to_string(),
));
}
return Ok((typed_metadata.to_vec(), typed_value.to_vec()));
}
}
}
None => match value {
Some(value) => Ok((metadata.to_vec(), value.to_vec())),
None => Err(ArrowError::InvalidArgumentError(
"No value or typed value provided".to_string(),
)),
},
}
}
If we strongly type typed_value
to whatever the variant type is, the primitive variants work nicely, since we'd be storing a PrimitiveArray<T>
and we can easily index into it.
For complex variants like objects/lists, I'm having a hard time figuring out how to strongly type these values. Maybe for these cases we should just store the binary encoded version of a Variant? (metadata, value).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how do we plan on storing
typed_value
s? Do we plan on encoding it as aVariant
and storing the binary representation of it?
As I understand it, we don't need to store typed_value
in the first place.
This is low-level variant code here, and a given variant builder only knows about the unshredded bits (it's trying to build up a value
column). Whoever manages the builder will be in charge of the shredded bits, and is also in charge of providing an appropriate metadata "builder" for the metadata
column. I think @alamb is proposing in this specific case to convert the typed_value
into a proper variant, but then "merge" it into the existing variant using a view, instead of copying (some of my thoughts on that below).
Aside: I think there are actually three kinds of metadata builders
- Today's traditional
MetadataBuilder
(renamed here asDefaultMetadataBuilder
)- Used to build up a top-level variant. Every newly encountered path is an insertion into that metadata dictionary
- A metadata builder used only for unshredding a
typed_value
column (ReadOnlyMetadataBuilder
here) - A "borrowed" parent metadata builder (used during shredding operations, not present in this PR)
- See [Variant] Define shredding schema for
VariantArrayBuilder
#7921 (comment) - Used for building up variant columns that happen to live inside the
typed_value
column of some shredded parent variant ancestor column, since the spec requires that outer metadata dictionary to contain all paths reachable through that ancestor variant column.
- See [Variant] Define shredding schema for
What if we made a view into the variant so we didn't have to copy anything to read the Variant?
Interesting. Trying to wrap my head around this idea, so what follows is likely a bit rambly:
- What do we gain by using views at this low level?
- At the high level, the
VariantArray
just stores an actual (strongly-typed)typed_value
columns. It doesn't need to unshred at all, unless/until somebody executes avariant_get
kernel with a path that ends inside thetyped_value
column.- In that case, we have no choice but to unshred that path, which is a plain old ordinary variant building exercise because the spec forbids both
value
andtyped_value
to contain the same path -- except themetadata
column is already fully built and we can just reuse it asReadOnlyMetadata
.
- In that case, we have no choice but to unshred that path, which is a plain old ordinary variant building exercise because the spec forbids both
- If somebody does a
variant_get
on any path that ends inside thevalue
column, we just extract that variant as normal (the shredded bits provably don't matter). - If somebody actually asks to unshred the entire top-level column (e.g. invokes
unshred_variant
in preparation for a write), the result is a simple struct-of-binary columns. No (low level)Variant
anywhere in sight -- they're an internal detail of that kernel.- If we're anyway forced to write out normal variant bytes, where/how does the view help along the way?
- At the high level, the
- Nesting will be "fun"
- In the example from my comment on the other PR (linked above), one could have a (horrible) nested scenario like:
v.typed_value.a.b.c.w.typed_value.x.y.z.u: VARIANT
. If we tried to unshred that using views, I'm pretty sure it would quickly fall apart, becauseVariant
is ultimately already a view on top of the (one) underlying&[u8]
it was built from. So there's no way a field of aVariant::ShreddedObject
could itself contain otherVariant::ShreddedObject
.
- In the example from my comment on the other PR (linked above), one could have a (horrible) nested scenario like:
- Intuitive claim (unsubstantiated):
- If we could get away with "views" here on the unshredding path, why can't the native builder get away with "views" internally to avoid all the repeated copying of bytes as we unwind nested builders?
- But every time I've tried to get fancy with copy-free nested builders, it gets ugly fast.
- So (again, intuitively), I would conclude that any attempt at views here would also get ugly fast.
- Or, if it turns out we can elegantly solve it here, let's be sure to apply the solution there!
- Isolated reaction: I would favor
Variant::ShreddedObject
as its own enum variant, rather than makingVariant::Object
more complex. They're pretty fundamentally different cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... and then the VariantArray
's variant builder accepts the resulting Variant::ShreddedObject
and actually turns it to bytes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... at which point the ReadOnlyMetadataBuilder
becomes useful, because that final operation shouldn't allocate any new field ids.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Observation: I think the Variant::ShreddedObject
is mostly (only?) useful on the read (unshredding) path; a writer who wants to produce shredded variant almost certainly doesn't want to work with this API because it's ultimately just a collection of variant values that were probably strongly typed at first and now have to pay a round trip through a Variant
? Shredding would be able to consume these objects, e.g. if re-shredding something that was previously shredded, but it's not the preferred approach for turning e.g. a StructArray
directly into VariantArray
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Second observation: How helpful is it, to have the Variant::ShreddedObject::value
member? Should we just extract its fields directly into the map instead? Then we just have to sort the map and write them out again. We might also want to have the map key off field ids instead of &str
, using lookups into the top-level VariantMetadata
object. That way, the final to-bytes transformation doesn't need a metadata builder at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Observation: I think the Variant::ShreddedObject is mostly (only?) useful on the read (unshredding) path; a writer who wants to produce shredded variant almost certainly doesn't want to work with this API because it's ultimately just a collection of variant values that were probably strongly typed at first and now have to pay a round trip through a Variant?
Yes I agree this would be mostly for read (rather than on the write path)
Shredding would be able to consume these objects, e.g. if re-shredding something that was previously shredded, but it's not the preferred approach for turning e.g. a StructArray directly into VariantArray.
Right
Which issue does this PR close?
Pathfinding related to the unshredding aspects of
Other pathfinding relevant to unshredding, that this PR does not attempt:
Rationale for this change
The variant shredding spec requires that the top-level variant metadata already contains all field names, whether those were shredded or not. That means an unshredding operation must use the existing
VariantMetadata
for the column, and there's no point paying to build a new one.What changes are included in this PR?
Pathfinding change, where
MetadataBuilder
is renamed asDefaultMetadataBuilder
that implements a newMetadataBuilder
trait, whichVariantMetadata
can also implement.VariantBuilder
then takes a type parameter for the specificM: MetadataBuilder
that it uses.For convenience,
VariantBuilder
becomes a type alias forGenericVariantBuilder<DefaultMetadataBuilder>
, so most callers don't need to care.This change also requires several previously-infallible methods to become fallible, because attempting to insert a non-existent field name must fail.
Are these changes tested?
Existing tests should cover the refactoring.
A couple new tests lightly exercise the new read-only metadata.
Are there any user-facing changes?
A whole bunch:
impl Default
any more