-
Notifications
You must be signed in to change notification settings - Fork 976
[Variant] test: add variant object tests with different sizes #7896
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
843eb67
to
34bb605
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you very much for this PR @odysa -- it is a great start. I left some suggestions, let me know what you think.
FYI @codephage2020
parquet-variant/src/builder.rs
Outdated
} | ||
|
||
#[test] | ||
#[ignore] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this one ignored?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Inserting 2²⁴ fields took 66 seconds on my laptop. I've marked it as ignored for now while I explore a faster approach.
parquet-variant/src/builder.rs
Outdated
Variant::Int32(count - 1) | ||
); | ||
|
||
let header_byte = first_byte_from_slice(&value).unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of extracting the field value directly, you could also use the API in VariantHeader, like @codephage2020 did in #7876
Something like
let expected_offset_size = OffsetSizeBytes::Three;
assert_eq!(obj.header.field_offset_size, expected_offset_size)
You would also have to put the tests into object.rs
so the header field was accessable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Alamb. The person you may want to mention is him @klion26 .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I apologize, I tagged the wrong collaborator. Please feel free to comment and review too @codephage2020
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
ad2b013
to
544ff48
Compare
} | ||
|
||
#[test] | ||
fn test_variant_object_16777217_elements() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test takes about 45 seconds to run on my laptop. THe next longest test takes less than a second
Is there any way to make this faster?
PASS [ 45.136s] parquet-variant variant::object::tests::test_variant_object_16777217_elements
PASS [ 0.126s] parquet-variant variant::object::tests::test_variant_object_65535_bytes_child_data_2_byte_offsets
PASS [ 0.144s] parquet-variant variant::object::tests::test_variant_object_65537_elements
PASS [ 0.697s] parquet-variant variant::list::tests::test_large_variant_list_with_total_child_length_between_2_pow_24_and_2_pow_32
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking about something like batch write.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my brief profiling suggests that a large amount of the time was inserting the field names in (basically doing the hash table lookup). I am not sure how we would fix that -- you could try pre-populating the field names in the builder I suppose
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tested pre-populating.
let field_names: Vec<String> = (0..count).map(|val| val.to_string()).collect();
let s = field_names.iter().map(|s| s.as_str());
let start = std::time::Instant::now();
let mut builder = VariantBuilder::new().with_field_names(s);
println!("Pre-populating field names: {:?}", start.elapsed());
Pre-populating field names: 20.150840416s
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sadly it still seems to take 45 seconds on my machine
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @odysa -- this looks really nice -- now the only thing left is to figure out how to avoid spending 45 seconds on a test
cc @friendlymatthew in case you have some other thoughts
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, I took a look and was able to get the test case from ~45s to ~33s on my machine. (To be clear, half of this time is spent formatting 2**24 field names.)
The main wins were:
- reserve capacity upfront #7922
- avoid full validation when reconstructing the variant
|
||
obj.finish().unwrap(); | ||
let (metadata, value) = builder.finish(); | ||
let variant = Variant::try_new(&metadata, &value).unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using the builder already does validation, so we can save a lot of time by doing unchecked validation when reconstructing the variant.
Something like:
let variant = Variant::new(&metadata, &value);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. change applied.
Variant::Int32(count - 1) | ||
); | ||
|
||
let header_byte = first_byte_from_slice(&value).unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Do we need to compute the header here? Seems we can get the header from
obj.header
- Do we need to verify the
field_offset_size
inVariantObjectHeader
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- done
- It's tested in
test_variant_object_with_large_data
@@ -400,6 +400,8 @@ impl<'m, 'v> VariantObject<'m, 'v> { | |||
|
|||
#[cfg(test)] | |||
mod tests { | |||
use crate::VariantBuilder; | |||
|
|||
use super::*; | |||
|
|||
#[test] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment in line 411 seems incorrect -- So header byte = 00_0_1_0001 = 0x10
. Maybe we can improve it also
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed to 0x11
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @odysa
In order to make progress, I would like to merge this PR in.
I took the liberty of commenting out the test that takes 45 seconds and I'll merge this one in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @odysa
In order to make progress, I would like to merge this PR in.
I took the liberty of commenting out the test that takes 45 seconds and I'll merge this one in.
Thanks gaain @odysa |
…7922) # Which issue does this PR close? - Part of #7896 # Rationale for this change In #7896, we saw that inserting a large amount of field names takes a long time -- in this case ~45s to insert 2**24 field names. The bulk of this time is spent just allocating the strings, but we also see quite a bit of time spent reallocating the `IndexSet` that we're inserting into. `with_field_names` is an optimization to declare the field names upfront which avoids having to reallocate and rehash the entire `IndexSet` during field name insertion. Using this method requires at least 2 string allocations for each field name -- 1 to declare field names upfront and 1 to insert the actual field name during object building. This PR adds a new method `with_field_name_capacity` which allows you to reserve space to the metadata builder, without needing to allocate the field names themselves upfront. In this case, we see a modest performance improvement when inserting the field names during object building Before: <img width="1512" height="829" alt="Screenshot 2025-07-13 at 12 08 43 PM" src="https://github.com/user-attachments/assets/6ef0d9fe-1e08-4d3a-8f6b-703de550865c" /> After: <img width="1512" height="805" alt="Screenshot 2025-07-13 at 12 08 55 PM" src="https://github.com/user-attachments/assets/2faca4cb-0a51-441b-ab6c-5baa1dae84b3" />
Which issue does this PR close?
We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax.
VariantObjects
s #7821 .Rationale for this change
VariantObject with between 2^8 and 2^16 elements ( field_id_size_minus_1 = 1, 2 byte field ids)
VariantObject with between 2^16 and 2^24 elements ( field_id_size_minus_1 = 2, 3 byte field ids)
VariantObject with between 2^24 and 2^32 elements ( field_id_size_minus_1 = 3, 4 byte field ids)
VariantObject with total child data length between 2^8 and 2^16 elements ( field_offset_size_minus_1 = 1, 2 byte field offsets)
VariantObject with total child data length between 2^16 and 2^24 elements ( field_offset_size_minus_1 = 2, 3 byte field offsets)
VariantObject with total child data length between 2^24 and 2^32 elements ( field_offset_size_minus_1 = 3, 4 byte field offsets)