Skip to content

Commit 0055f57

Browse files
[Variant] Reserve capacity beforehand during large object building (#7922)
# Which issue does this PR close? - Part of #7896 # Rationale for this change In #7896, we saw that inserting a large amount of field names takes a long time -- in this case ~45s to insert 2**24 field names. The bulk of this time is spent just allocating the strings, but we also see quite a bit of time spent reallocating the `IndexSet` that we're inserting into. `with_field_names` is an optimization to declare the field names upfront which avoids having to reallocate and rehash the entire `IndexSet` during field name insertion. Using this method requires at least 2 string allocations for each field name -- 1 to declare field names upfront and 1 to insert the actual field name during object building. This PR adds a new method `with_field_name_capacity` which allows you to reserve space to the metadata builder, without needing to allocate the field names themselves upfront. In this case, we see a modest performance improvement when inserting the field names during object building Before: <img width="1512" height="829" alt="Screenshot 2025-07-13 at 12 08 43 PM" src="https://github.com/user-attachments/assets/6ef0d9fe-1e08-4d3a-8f6b-703de550865c" /> After: <img width="1512" height="805" alt="Screenshot 2025-07-13 at 12 08 55 PM" src="https://github.com/user-attachments/assets/2faca4cb-0a51-441b-ab6c-5baa1dae84b3" />
1 parent 7b7aad2 commit 0055f57

File tree

2 files changed

+26
-1
lines changed

2 files changed

+26
-1
lines changed

parquet-variant/benches/variant_builder.rs

Lines changed: 14 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -495,6 +495,18 @@ fn bench_iteration_performance(c: &mut Criterion) {
495495
group.finish();
496496
}
497497

498+
fn bench_extend_metadata_builder(c: &mut Criterion) {
499+
let list = (0..400_000).map(|i| format!("id_{i}")).collect::<Vec<_>>();
500+
501+
c.bench_function("bench_extend_metadata_builder", |b| {
502+
b.iter(|| {
503+
std::hint::black_box(
504+
VariantBuilder::new().with_field_names(list.iter().map(|s| s.as_str())),
505+
);
506+
})
507+
});
508+
}
509+
498510
criterion_group!(
499511
benches,
500512
bench_object_field_names_reverse_order,
@@ -505,7 +517,8 @@ criterion_group!(
505517
bench_object_partially_same_schema,
506518
bench_object_list_partially_same_schema,
507519
bench_validation_validated_vs_unvalidated,
508-
bench_iteration_performance
520+
bench_iteration_performance,
521+
bench_extend_metadata_builder
509522
);
510523

511524
criterion_main!(benches);

parquet-variant/src/builder.rs

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -402,6 +402,11 @@ impl<S: AsRef<str>> FromIterator<S> for MetadataBuilder {
402402

403403
impl<S: AsRef<str>> Extend<S> for MetadataBuilder {
404404
fn extend<T: IntoIterator<Item = S>>(&mut self, iter: T) {
405+
let iter = iter.into_iter();
406+
let (min, _) = iter.size_hint();
407+
408+
self.field_names.reserve(min);
409+
405410
for field_name in iter {
406411
self.upsert_field_name(field_name.as_ref());
407412
}
@@ -760,6 +765,13 @@ impl VariantBuilder {
760765
self
761766
}
762767

768+
/// This method reserves capacity for field names in the Variant metadata,
769+
/// which can improve performance when you know the approximate number of unique field
770+
/// names that will be used across all objects in the [`Variant`].
771+
pub fn reserve(&mut self, capacity: usize) {
772+
self.metadata_builder.field_names.reserve(capacity);
773+
}
774+
763775
/// Adds a single field name to the field name directory in the Variant metadata.
764776
///
765777
/// This method does the same thing as [`VariantBuilder::with_field_names`] but adds one field name at a time.

0 commit comments

Comments
 (0)