Skip to content

[Variant] Add variant_kernels benchmark #7944

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Jul 18, 2025
Merged

Conversation

alamb
Copy link
Contributor

@alamb alamb commented Jul 16, 2025

Which issue does this PR close?

We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax.

Rationale for this change

I wrote benchmark some changes to the json decoder in #7911 but they are non trivial. To keep #7911 easier to review I have pulled the benchmarks out to their own PR

What changes are included in this PR?

  1. Add new json benchmark
  2. Include the variant_get benchmark added in [Variant] Add variant_get compute kernel #7919 by @Samyak2

Are these changes tested?

I tested them manually and clippy CI coverage ensures they compile

Are there any user-facing changes?

No these are only benchmarks

@alamb
Copy link
Contributor Author

alamb commented Jul 16, 2025

@harshmotw-db and @Samyak2 I wonder if you have time to review this PR?

variant_builder.build()
}

pub fn variant_get_bench(c: &mut Criterion) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

variant_get benchmark is moved without modification here

@@ -1,59 +0,0 @@
// Licensed to the Apache Software Foundation (ASF) under one
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file was actually renamed to variant_kernels but github is rendering it as a deleted file

< self.null_probability + self.string_probability + self.number_probability
{
// Generate a random number
let random_number: f64 = self.rng.random_range(-1000.0..1000.0);
Copy link
Contributor

@harshmotw-db harshmotw-db Jul 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we wanna maybe further split precise numbers (decimals/ints) and floating points since they have different representations in Variant? I could also have integers as a separate class

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done (randomly pick between integer and float 50% of the time)

Copy link
Contributor

@harshmotw-db harshmotw-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Just left a few minor comments


/// This function generates a vector of JSON strings which have many fields
/// and a random structure (including field names)
fn small_random_json_structure(count: usize) -> impl Iterator<Item = String> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this "small"? The function body looks general enough to me. Is it because of the default max_depth value of 5? Even so, theoretically, the variant could be up to about (10^5)*500 bytes however unlikely the higher end of the sizes may be

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right -- I actually did a test and it turns out the generated documents are actually 2635 bytes each, which I agree is not small. I renamed this to random_json

Total input JSON size: 21082241 bytes (2635.280125 bytes per string)

New name:

Benchmarking batch_json_string_to_variant random_json(2635 bytes per document): Warming up for 3.0000 s

write!(&mut self.output_buffer, "}}").unwrap();
} else {
// Generate a random array
let length = self.rng.random_range(1..=10);
Copy link
Contributor

@harshmotw-db harshmotw-db Jul 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of having these as hard-coded constants, is there merit to having maxArraySize and maxObjectSize be parameters of RandomJsonGenerator? Similar parameters could also be added for string lengths if we're okay with crowding the struct.

Edit: Having a maxNestedWidth which controls both maxArraySize and maxObjectSize sounds like a good idea.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done (as separate fields)

Copy link
Contributor

@Samyak2 Samyak2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

write!(&mut self.output_buffer, "null").unwrap();
} else if random_value < self.null_probability + self.string_probability {
// Generate a random string between 1 and 500 characters
let length = self.rng.random_range(1..=500);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add a separate probability for short strings (<64 bytes) given that they have a different representation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That probably makes the most sense. I will update.

Copy link
Contributor Author

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @Samyak2 and @harshmotw-db -- I made the changes you suggested and I think the benchmarks are looking better as a result


/// This function generates a vector of JSON strings which have many fields
/// and a random structure (including field names)
fn small_random_json_structure(count: usize) -> impl Iterator<Item = String> {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right -- I actually did a test and it turns out the generated documents are actually 2635 bytes each, which I agree is not small. I renamed this to random_json

Total input JSON size: 21082241 bytes (2635.280125 bytes per string)

New name:

Benchmarking batch_json_string_to_variant random_json(2635 bytes per document): Warming up for 3.0000 s

write!(&mut self.output_buffer, "null").unwrap();
} else if random_value < self.null_probability + self.string_probability {
// Generate a random string between 1 and 500 characters
let length = self.rng.random_range(1..=500);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That probably makes the most sense. I will update.

< self.null_probability + self.string_probability + self.number_probability
{
// Generate a random number
let random_number: f64 = self.rng.random_range(-1000.0..1000.0);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done (randomly pick between integer and float 50% of the time)

write!(&mut self.output_buffer, "}}").unwrap();
} else {
// Generate a random array
let length = self.rng.random_range(1..=10);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done (as separate fields)

@alamb
Copy link
Contributor Author

alamb commented Jul 17, 2025

Please let me know if you would like additional changes

@alamb alamb merged commit 71dd48e into apache:main Jul 18, 2025
10 checks passed
@alamb
Copy link
Contributor Author

alamb commented Jul 18, 2025

Thanks again everyone

@alamb alamb deleted the alamb/json_benches branch July 18, 2025 15:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants