-
Notifications
You must be signed in to change notification settings - Fork 982
[Variant] Add variant_kernels
benchmark
#7944
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@harshmotw-db and @Samyak2 I wonder if you have time to review this PR? |
variant_builder.build() | ||
} | ||
|
||
pub fn variant_get_bench(c: &mut Criterion) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
variant_get benchmark is moved without modification here
@@ -1,59 +0,0 @@ | |||
// Licensed to the Apache Software Foundation (ASF) under one |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file was actually renamed to variant_kernels
but github is rendering it as a deleted file
< self.null_probability + self.string_probability + self.number_probability | ||
{ | ||
// Generate a random number | ||
let random_number: f64 = self.rng.random_range(-1000.0..1000.0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we wanna maybe further split precise numbers (decimals/ints) and floating points since they have different representations in Variant? I could also have integers as a separate class
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done (randomly pick between integer and float 50% of the time)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Just left a few minor comments
|
||
/// This function generates a vector of JSON strings which have many fields | ||
/// and a random structure (including field names) | ||
fn small_random_json_structure(count: usize) -> impl Iterator<Item = String> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this "small"? The function body looks general enough to me. Is it because of the default max_depth value of 5? Even so, theoretically, the variant could be up to about (10^5)*500 bytes however unlikely the higher end of the sizes may be
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right -- I actually did a test and it turns out the generated documents are actually 2635 bytes each, which I agree is not small. I renamed this to random_json
Total input JSON size: 21082241 bytes (2635.280125 bytes per string)
New name:
Benchmarking batch_json_string_to_variant random_json(2635 bytes per document): Warming up for 3.0000 s
write!(&mut self.output_buffer, "}}").unwrap(); | ||
} else { | ||
// Generate a random array | ||
let length = self.rng.random_range(1..=10); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of having these as hard-coded constants, is there merit to having maxArraySize
and maxObjectSize
be parameters of RandomJsonGenerator? Similar parameters could also be added for string lengths if we're okay with crowding the struct.
Edit: Having a maxNestedWidth
which controls both maxArraySize
and maxObjectSize
sounds like a good idea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done (as separate fields)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!
write!(&mut self.output_buffer, "null").unwrap(); | ||
} else if random_value < self.null_probability + self.string_probability { | ||
// Generate a random string between 1 and 500 characters | ||
let length = self.rng.random_range(1..=500); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we add a separate probability for short strings (<64 bytes) given that they have a different representation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That probably makes the most sense. I will update.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @Samyak2 and @harshmotw-db -- I made the changes you suggested and I think the benchmarks are looking better as a result
|
||
/// This function generates a vector of JSON strings which have many fields | ||
/// and a random structure (including field names) | ||
fn small_random_json_structure(count: usize) -> impl Iterator<Item = String> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right -- I actually did a test and it turns out the generated documents are actually 2635 bytes each, which I agree is not small. I renamed this to random_json
Total input JSON size: 21082241 bytes (2635.280125 bytes per string)
New name:
Benchmarking batch_json_string_to_variant random_json(2635 bytes per document): Warming up for 3.0000 s
write!(&mut self.output_buffer, "null").unwrap(); | ||
} else if random_value < self.null_probability + self.string_probability { | ||
// Generate a random string between 1 and 500 characters | ||
let length = self.rng.random_range(1..=500); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That probably makes the most sense. I will update.
< self.null_probability + self.string_probability + self.number_probability | ||
{ | ||
// Generate a random number | ||
let random_number: f64 = self.rng.random_range(-1000.0..1000.0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done (randomly pick between integer and float 50% of the time)
write!(&mut self.output_buffer, "}}").unwrap(); | ||
} else { | ||
// Generate a random array | ||
let length = self.rng.random_range(1..=10); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done (as separate fields)
Please let me know if you would like additional changes |
Thanks again everyone |
Which issue does this PR close?
We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax.
VariantArray
andVariantArrayBuilder
for constructing Arrow Arrays of Variants #7905Rationale for this change
I wrote benchmark some changes to the json decoder in #7911 but they are non trivial. To keep #7911 easier to review I have pulled the benchmarks out to their own PR
What changes are included in this PR?
variant_get
benchmark added in [Variant] Addvariant_get
compute kernel #7919 by @Samyak2Are these changes tested?
I tested them manually and clippy CI coverage ensures they compile
Are there any user-facing changes?
No these are only benchmarks