Skip to content

[Variant] Reserve capacity beforehand during large object building #7922

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Jul 16, 2025

Conversation

friendlymatthew
Copy link
Contributor

Which issue does this PR close?

Rationale for this change

In #7896, we saw that inserting a large amount of field names takes a long time -- in this case ~45s to insert 2**24 field names. The bulk of this time is spent just allocating the strings, but we also see quite a bit of time spent reallocating the IndexSet that we're inserting into.

with_field_names is an optimization to declare the field names upfront which avoids having to reallocate and rehash the entire IndexSet during field name insertion. Using this method requires at least 2 string allocations for each field name -- 1 to declare field names upfront and 1 to insert the actual field name during object building.

This PR adds a new method with_field_name_capacity which allows you to reserve space to the metadata builder, without needing to allocate the field names themselves upfront. In this case, we see a modest performance improvement when inserting the field names during object building

Before:
Screenshot 2025-07-13 at 12 08 43 PM

After:
Screenshot 2025-07-13 at 12 08 55 PM

@github-actions github-actions bot added the parquet Changes to the parquet crate label Jul 13, 2025
@friendlymatthew
Copy link
Contributor Author

cc @alamb @scovich

Copy link
Contributor

@odysa odysa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @friendlymatthew

I think @odysa 's suggestion at https://github.com/apache/arrow-rs/pull/7922/files#r2203535208 is worth considering too

cc @scovich

let iter = iter.into_iter();
let (min, max) = iter.size_hint();

self.field_names.reserve(max.unwrap_or(min));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this looks reasonable to me and is consistent with the docs
https://doc.rust-lang.org/std/iter/trait.Iterator.html#method.size_hint

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @friendlymatthew -- I think the benchmark has some problems but otherwise this PR is good to go

@@ -495,6 +495,18 @@ fn bench_iteration_performance(c: &mut Criterion) {
group.finish();
}

fn bench_extend_metadata_builder(c: &mut Criterion) {
let list = (0..u32::MAX).map(|i| format!("id_{i}")).collect::<Vec<_>>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW this benchmark doesn't finish on my machine in a reasonable amount of time

time cargo bench --bench variant_builder -- bench_validate_large_nested_list

Probably b/c it creates 4B strings

I think you could get the same effect using 1000 strings or something -- I don't think there is any need for 4 billion 🤔

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only waited 2 minutes before cancelling 😆

andrewlamb@Andrews-MacBook-Pro-3:~/Software/arrow-rs$ time cargo bench --bench variant_builder -- bench_validate_large_nested_list
   Compiling parquet-variant v0.1.0 (/Users/andrewlamb/Software/arrow-rs/parquet-variant)
    Finished `bench` profile [optimized] target(s) in 1.44s
     Running benches/variant_builder.rs (target/release/deps/variant_builder-5d3061d750cb04a4)
^C

real	2m13.341s
user	0m6.141s
sys	0m0.293s

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haha -- that's my bad!!

I'll make this more sensible

@friendlymatthew friendlymatthew force-pushed the friendlymatthew/investigate-list-perf branch from 5d2136b to c43f2f3 Compare July 16, 2025 14:48
@friendlymatthew friendlymatthew force-pushed the friendlymatthew/investigate-list-perf branch from c43f2f3 to c997854 Compare July 16, 2025 14:53
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again @friendlymatthew and @scovich and @odysa

I also verified the benchmark runs well:

$ cargo bench --bench variant_builder -- bench_extend_metadata_builder
    Finished `bench` profile [optimized] target(s) in 0.09s
     Running benches/variant_builder.rs (target/release/deps/variant_builder-5d3061d750cb04a4)
bench_extend_metadata_builder
                        time:   [14.798 ms 14.873 ms 14.952 ms]
Found 7 outliers among 100 measurements (7.00%)
  1 (1.00%) low mild
  4 (4.00%) high mild
  2 (2.00%) high severe

@alamb alamb merged commit 0055f57 into apache:main Jul 16, 2025
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants