-
-
Notifications
You must be signed in to change notification settings - Fork 780
Reduce sorting in TopDocs #2646
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
4526fc0
to
267b41b
Compare
|
267b41b
to
f0f92aa
Compare
f0f92aa
to
36e1713
Compare
black_box(searcher.search(&AllQuery, &collector).unwrap()); | ||
} | ||
fn top_docs_small_deep(index: &Index) { | ||
execute_top_docs::<u64>(index, "score", Order::Asc, 10000, 10); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: the benches all test Ascending order only from what I can see, so this function could either be removed or should add some runs in descending order as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean that you would prefer that the order: Order
argument be removed? I'm fine with that, but it can also be useful to expose assumptions like that in the caller.
@@ -402,7 +402,7 @@ fn get_test_index_bench(cardinality: Cardinality) -> tantivy::Result<Index> { | |||
.collect::<Vec<_>>(); | |||
{ | |||
let mut rng = StdRng::from_seed([1u8; 32]); | |||
let mut index_writer = index.writer_with_num_threads(1, 200_000_000)?; | |||
let mut index_writer = index.writer_with_num_threads(8, 200_000_000)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This changes the test to have 8 segments instead of 1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's intentional. I believe that it should change to something other than 1, as it's definitely not realistic to have 1 segment in production.
It's very likely though that you would want the segments to be produced in some more deterministic order though, so if you'd rather I revert this here, that's totally fine.
fn top_docs_small_deep(index: &Index) { | ||
execute_top_docs::<u64>(index, "score", Order::Asc, 10000, 10); | ||
} | ||
fn top_docs_small_shallow(index: &Index) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fn top_docs_small_shallow(index: &Index) { | |
fn top_docs_top_10(index: &Index) { |
fn top_docs_small_shallow(index: &Index) { | ||
execute_top_docs::<u64>(index, "score", Order::Asc, 0, 10); | ||
} | ||
fn top_docs_large_deep(index: &Index) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fn top_docs_large_deep(index: &Index) { | |
fn top_docs_top_1_000_skip_10_000(index: &Index) { |
OptionalSparse = 3, | ||
} | ||
|
||
fn get_test_index_bench(cardinality: Cardinality) -> tantivy::Result<Index> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
most fields are unused
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a function of having cloned this from agg_bench
, which @fulmicoton suggested.
I'll be honest: I think that we would be better off renaming agg_bench
to collector_bench
, and then putting the TopDocs
benchmark functions in there as well. Because aggregations are also "just" collectors, and both sets of benchmarks want to consume similar datasets.
If we want to keep them in two benchmark files, then how would you feel about adding an un-published benchmarks-support
crate containing this function and other support code?
text_field_few_terms => "cool", | ||
text_field_few_terms => "cool", | ||
score_field => 1u64, | ||
score_field => 1u64, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The value distribution is quite different for the multivalue case for the score field
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See above: this is copypasta from agg_bench
.
Ping on this one. It seems like the primary thing to figure out is where the benchmarks should live, and if they should be separated, then whether support code should be broken out into another crate. @PSeitz, @fulmicoton: What do you prefer? |
Reduce sorting in
TopDocs
by not sorting individual segments -- they will be merged and top-n'd anyway.