Skip to content

[Variant] test: add variant object tests with different sizes #7896

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Jul 15, 2025

Conversation

odysa
Copy link
Contributor

@odysa odysa commented Jul 11, 2025

Which issue does this PR close?

We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax.

Rationale for this change

  • VariantObject with between 2^8 and 2^16 elements ( field_id_size_minus_1 = 1, 2 byte field ids)

  • VariantObject with between 2^16 and 2^24 elements ( field_id_size_minus_1 = 2, 3 byte field ids)

  • VariantObject with between 2^24 and 2^32 elements ( field_id_size_minus_1 = 3, 4 byte field ids)

    • Inserting 2^24 + 1 elements takes too long.
  • VariantObject with total child data length between 2^8 and 2^16 elements ( field_offset_size_minus_1 = 1, 2 byte field offsets)

  • VariantObject with total child data length between 2^16 and 2^24 elements ( field_offset_size_minus_1 = 2, 3 byte field offsets)

  • VariantObject with total child data length between 2^24 and 2^32 elements ( field_offset_size_minus_1 = 3, 4 byte field offsets)

@github-actions github-actions bot added the parquet Changes to the parquet crate label Jul 11, 2025
@odysa odysa force-pushed the test/variantObjects branch from 843eb67 to 34bb605 Compare July 11, 2025 03:04
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much for this PR @odysa -- it is a great start. I left some suggestions, let me know what you think.

FYI @codephage2020

}

#[test]
#[ignore]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this one ignored?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inserting 2²⁴ fields took 66 seconds on my laptop. I've marked it as ignored for now while I explore a faster approach.

Variant::Int32(count - 1)
);

let header_byte = first_byte_from_slice(&value).unwrap();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of extracting the field value directly, you could also use the API in VariantHeader, like @codephage2020 did in #7876

Something like

let expected_offset_size = OffsetSizeBytes::Three;
assert_eq!(obj.header.field_offset_size, expected_offset_size)

You would also have to put the tests into object.rs so the header field was accessable

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Alamb. The person you may want to mention is him @klion26 .

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I apologize, I tagged the wrong collaborator. Please feel free to comment and review too @codephage2020

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@odysa odysa force-pushed the test/variantObjects branch from ad2b013 to 544ff48 Compare July 12, 2025 14:25
}

#[test]
fn test_variant_object_16777217_elements() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test takes about 45 seconds to run on my laptop. THe next longest test takes less than a second

Is there any way to make this faster?

        PASS [  45.136s] parquet-variant variant::object::tests::test_variant_object_16777217_elements
   PASS [   0.126s] parquet-variant variant::object::tests::test_variant_object_65535_bytes_child_data_2_byte_offsets
    PASS [   0.144s] parquet-variant variant::object::tests::test_variant_object_65537_elements
    PASS [   0.697s] parquet-variant variant::list::tests::test_large_variant_list_with_total_child_length_between_2_pow_24_and_2_pow_32

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking about something like batch write.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my brief profiling suggests that a large amount of the time was inserting the field names in (basically doing the hash table lookup). I am not sure how we would fix that -- you could try pre-populating the field names in the builder I suppose

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested pre-populating.

let field_names: Vec<String> = (0..count).map(|val| val.to_string()).collect();
let s = field_names.iter().map(|s| s.as_str());
let start = std::time::Instant::now();
let mut builder = VariantBuilder::new().with_field_names(s);
println!("Pre-populating field names: {:?}", start.elapsed());

Pre-populating field names: 20.150840416s

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sadly it still seems to take 45 seconds on my machine

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @odysa -- this looks really nice -- now the only thing left is to figure out how to avoid spending 45 seconds on a test

cc @friendlymatthew in case you have some other thoughts

@odysa
Copy link
Contributor Author

odysa commented Jul 13, 2025

I profiled it. It shows frequent hash table resizing during insertion operations.
The goal of this issue is to explore methods for fast populating a hashmap with 2^24 entries.
image

Copy link
Contributor

@friendlymatthew friendlymatthew left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, I took a look and was able to get the test case from ~45s to ~33s on my machine. (To be clear, half of this time is spent formatting 2**24 field names.)

The main wins were:

  • reserve capacity upfront #7922
  • avoid full validation when reconstructing the variant


obj.finish().unwrap();
let (metadata, value) = builder.finish();
let variant = Variant::try_new(&metadata, &value).unwrap();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using the builder already does validation, so we can save a lot of time by doing unchecked validation when reconstructing the variant.

Something like:

let variant = Variant::new(&metadata, &value);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. change applied.

Variant::Int32(count - 1)
);

let header_byte = first_byte_from_slice(&value).unwrap();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Do we need to compute the header here? Seems we can get the header from obj.header
  2. Do we need to verify the field_offset_size in VariantObjectHeader

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. done
  2. It's tested in test_variant_object_with_large_data

@@ -400,6 +400,8 @@ impl<'m, 'v> VariantObject<'m, 'v> {

#[cfg(test)]
mod tests {
use crate::VariantBuilder;

use super::*;

#[test]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment in line 411 seems incorrect -- So header byte = 00_0_1_0001 = 0x10. Maybe we can improve it also

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed to 0x11

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @odysa

In order to make progress, I would like to merge this PR in.

I took the liberty of commenting out the test that takes 45 seconds and I'll merge this one in.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @odysa

In order to make progress, I would like to merge this PR in.

I took the liberty of commenting out the test that takes 45 seconds and I'll merge this one in.

@alamb alamb merged commit c40830e into apache:main Jul 15, 2025
12 checks passed
@alamb
Copy link
Contributor

alamb commented Jul 15, 2025

Thanks gaain @odysa

alamb pushed a commit that referenced this pull request Jul 16, 2025
…7922)

# Which issue does this PR close?

- Part of #7896

# Rationale for this change

In #7896, we saw that inserting a
large amount of field names takes a long time -- in this case ~45s to
insert 2**24 field names. The bulk of this time is spent just allocating
the strings, but we also see quite a bit of time spent reallocating the
`IndexSet` that we're inserting into.

`with_field_names` is an optimization to declare the field names upfront
which avoids having to reallocate and rehash the entire `IndexSet`
during field name insertion. Using this method requires at least 2
string allocations for each field name -- 1 to declare field names
upfront and 1 to insert the actual field name during object building.

This PR adds a new method `with_field_name_capacity` which allows you to
reserve space to the metadata builder, without needing to allocate the
field names themselves upfront. In this case, we see a modest
performance improvement when inserting the field names during object
building

Before: 
<img width="1512" height="829" alt="Screenshot 2025-07-13 at 12 08
43 PM"
src="https://github.com/user-attachments/assets/6ef0d9fe-1e08-4d3a-8f6b-703de550865c"
/>


After:
<img width="1512" height="805" alt="Screenshot 2025-07-13 at 12 08
55 PM"
src="https://github.com/user-attachments/assets/2faca4cb-0a51-441b-ab6c-5baa1dae84b3"
/>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Variant] Tests for creating "large" VariantObjectss
5 participants