-
Notifications
You must be signed in to change notification settings - Fork 976
[Variant] Reserve capacity beforehand during large object building #7922
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Variant] Reserve capacity beforehand during large object building #7922
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @friendlymatthew
I think @odysa 's suggestion at https://github.com/apache/arrow-rs/pull/7922/files#r2203535208 is worth considering too
cc @scovich
parquet-variant/src/builder.rs
Outdated
let iter = iter.into_iter(); | ||
let (min, max) = iter.size_hint(); | ||
|
||
self.field_names.reserve(max.unwrap_or(min)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this looks reasonable to me and is consistent with the docs
https://doc.rust-lang.org/std/iter/trait.Iterator.html#method.size_hint
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @friendlymatthew -- I think the benchmark has some problems but otherwise this PR is good to go
@@ -495,6 +495,18 @@ fn bench_iteration_performance(c: &mut Criterion) { | |||
group.finish(); | |||
} | |||
|
|||
fn bench_extend_metadata_builder(c: &mut Criterion) { | |||
let list = (0..u32::MAX).map(|i| format!("id_{i}")).collect::<Vec<_>>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW this benchmark doesn't finish on my machine in a reasonable amount of time
time cargo bench --bench variant_builder -- bench_validate_large_nested_list
Probably b/c it creates 4B strings
I think you could get the same effect using 1000
strings or something -- I don't think there is any need for 4 billion 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I only waited 2 minutes before cancelling 😆
andrewlamb@Andrews-MacBook-Pro-3:~/Software/arrow-rs$ time cargo bench --bench variant_builder -- bench_validate_large_nested_list
Compiling parquet-variant v0.1.0 (/Users/andrewlamb/Software/arrow-rs/parquet-variant)
Finished `bench` profile [optimized] target(s) in 1.44s
Running benches/variant_builder.rs (target/release/deps/variant_builder-5d3061d750cb04a4)
^C
real 2m13.341s
user 0m6.141s
sys 0m0.293s
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Haha -- that's my bad!!
I'll make this more sensible
5d2136b
to
c43f2f3
Compare
c43f2f3
to
c997854
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks again @friendlymatthew and @scovich and @odysa
I also verified the benchmark runs well:
$ cargo bench --bench variant_builder -- bench_extend_metadata_builder
Finished `bench` profile [optimized] target(s) in 0.09s
Running benches/variant_builder.rs (target/release/deps/variant_builder-5d3061d750cb04a4)
bench_extend_metadata_builder
time: [14.798 ms 14.873 ms 14.952 ms]
Found 7 outliers among 100 measurements (7.00%)
1 (1.00%) low mild
4 (4.00%) high mild
2 (2.00%) high severe
Which issue does this PR close?
Rationale for this change
In #7896, we saw that inserting a large amount of field names takes a long time -- in this case ~45s to insert 2**24 field names. The bulk of this time is spent just allocating the strings, but we also see quite a bit of time spent reallocating the
IndexSet
that we're inserting into.with_field_names
is an optimization to declare the field names upfront which avoids having to reallocate and rehash the entireIndexSet
during field name insertion. Using this method requires at least 2 string allocations for each field name -- 1 to declare field names upfront and 1 to insert the actual field name during object building.This PR adds a new method
with_field_name_capacity
which allows you to reserve space to the metadata builder, without needing to allocate the field names themselves upfront. In this case, we see a modest performance improvement when inserting the field names during object buildingBefore:

After:
