Skip to content

Reduce sorting in TopDocs #2646

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

stuhood
Copy link
Contributor

@stuhood stuhood commented Jun 7, 2025

Reduce sorting in TopDocs by not sorting individual segments -- they will be merged and top-n'd anyway.

full
top_docs_small_shallow    Memory: 97.3 KB (-0.58%)     Avg: 4.5728ms (-2.23%)      Median: 4.5707ms (-2.05%)      [4.5540ms .. 4.6014ms]
top_docs_small_deep       Memory: 6.2 MB (-0.01%)      Avg: 14.2786ms (-33.59%)    Median: 14.2741ms (-33.70%)    [14.1030ms .. 14.5956ms]
top_docs_large_shallow    Memory: 698.7 KB (-0.05%)    Avg: 6.8426ms (-8.48%)      Median: 6.8434ms (-8.26%)      [6.7612ms .. 6.9138ms]
top_docs_large_deep       Memory: 6.8 MB (-0.01%)      Avg: 14.2691ms (-37.26%)    Median: 14.2846ms (-37.28%)    [14.0335ms .. 14.3867ms]
dense
top_docs_small_shallow    Memory: 92.9 KB (-0.35%)    Avg: 4.6792ms (-2.13%)      Median: 4.6785ms (-2.01%)      [4.6559ms .. 4.7226ms]
top_docs_small_deep       Memory: 5.9 MB              Avg: 15.0375ms (-29.50%)    Median: 15.0420ms (-29.46%)    [14.8413ms .. 15.1579ms]
top_docs_large_shallow    Memory: 662.9 KB            Avg: 7.0038ms (-6.82%)      Median: 6.9954ms (-6.79%)      [6.9232ms .. 7.0746ms]
top_docs_large_deep       Memory: 6.4 MB              Avg: 15.2383ms (-32.39%)    Median: 15.2466ms (-32.32%)    [15.1046ms .. 15.3693ms]
sparse
top_docs_small_shallow    Memory: 40.7 KB (+0.59%)    Avg: 17.1796ms (+1.15%)    Median: 17.1670ms (+1.01%)    [17.1109ms .. 17.3223ms]
top_docs_small_deep       Memory: 2.9 MB              Avg: 22.8191ms (-6.98%)    Median: 22.8086ms (-6.93%)    [22.6527ms .. 23.3086ms]
top_docs_large_shallow    Memory: 324.8 KB            Avg: 18.6777ms (-0.57%)    Median: 18.6723ms (-0.64%)    [18.5654ms .. 18.8087ms]
top_docs_large_deep       Memory: 3.2 MB              Avg: 22.7570ms (-7.25%)    Median: 22.6764ms (-7.45%)    [22.5315ms .. 23.3794ms]
multivalue
top_docs_small_shallow    Memory: 93.7 KB (-4.49%)     Avg: 5.0297ms (-1.87%)      Median: 5.0248ms (-1.96%)      [4.9936ms .. 5.1738ms]
top_docs_small_deep       Memory: 5.9 MB (-5.25%)      Avg: 15.2124ms (-30.98%)    Median: 15.2013ms (-31.04%)    [15.0788ms .. 15.3836ms]
top_docs_large_shallow    Memory: 662.9 KB (-5.13%)    Avg: 7.2877ms (-8.40%)      Median: 7.2846ms (-8.60%)      [7.1833ms .. 7.3893ms]
top_docs_large_deep       Memory: 6.4 MB (-5.25%)      Avg: 15.3104ms (-34.34%)    Median: 15.2782ms (-34.49%)    [15.0508ms .. 15.6572ms]

@stuhood stuhood force-pushed the stuhood.reduce-top-n-sorting branch from 4526fc0 to 267b41b Compare June 15, 2025 00:26
@stuhood
Copy link
Contributor Author

stuhood commented Jun 15, 2025

full
top_docs_small_shallow    Memory: 97.3 KB (-0.58%)     Avg: 4.5728ms (-2.23%)      Median: 4.5707ms (-2.05%)      [4.5540ms .. 4.6014ms]
top_docs_small_deep       Memory: 6.2 MB (-0.01%)      Avg: 14.2786ms (-33.59%)    Median: 14.2741ms (-33.70%)    [14.1030ms .. 14.5956ms]
top_docs_large_shallow    Memory: 698.7 KB (-0.05%)    Avg: 6.8426ms (-8.48%)      Median: 6.8434ms (-8.26%)      [6.7612ms .. 6.9138ms]
top_docs_large_deep       Memory: 6.8 MB (-0.01%)      Avg: 14.2691ms (-37.26%)    Median: 14.2846ms (-37.28%)    [14.0335ms .. 14.3867ms]
dense
top_docs_small_shallow    Memory: 92.9 KB (-0.35%)    Avg: 4.6792ms (-2.13%)      Median: 4.6785ms (-2.01%)      [4.6559ms .. 4.7226ms]
top_docs_small_deep       Memory: 5.9 MB              Avg: 15.0375ms (-29.50%)    Median: 15.0420ms (-29.46%)    [14.8413ms .. 15.1579ms]
top_docs_large_shallow    Memory: 662.9 KB            Avg: 7.0038ms (-6.82%)      Median: 6.9954ms (-6.79%)      [6.9232ms .. 7.0746ms]
top_docs_large_deep       Memory: 6.4 MB              Avg: 15.2383ms (-32.39%)    Median: 15.2466ms (-32.32%)    [15.1046ms .. 15.3693ms]
sparse
top_docs_small_shallow    Memory: 40.7 KB (+0.59%)    Avg: 17.1796ms (+1.15%)    Median: 17.1670ms (+1.01%)    [17.1109ms .. 17.3223ms]
top_docs_small_deep       Memory: 2.9 MB              Avg: 22.8191ms (-6.98%)    Median: 22.8086ms (-6.93%)    [22.6527ms .. 23.3086ms]
top_docs_large_shallow    Memory: 324.8 KB            Avg: 18.6777ms (-0.57%)    Median: 18.6723ms (-0.64%)    [18.5654ms .. 18.8087ms]
top_docs_large_deep       Memory: 3.2 MB              Avg: 22.7570ms (-7.25%)    Median: 22.6764ms (-7.45%)    [22.5315ms .. 23.3794ms]
multivalue
top_docs_small_shallow    Memory: 93.7 KB (-4.49%)     Avg: 5.0297ms (-1.87%)      Median: 5.0248ms (-1.96%)      [4.9936ms .. 5.1738ms]
top_docs_small_deep       Memory: 5.9 MB (-5.25%)      Avg: 15.2124ms (-30.98%)    Median: 15.2013ms (-31.04%)    [15.0788ms .. 15.3836ms]
top_docs_large_shallow    Memory: 662.9 KB (-5.13%)    Avg: 7.2877ms (-8.40%)      Median: 7.2846ms (-8.60%)      [7.1833ms .. 7.3893ms]
top_docs_large_deep       Memory: 6.4 MB (-5.25%)      Avg: 15.3104ms (-34.34%)    Median: 15.2782ms (-34.49%)    [15.0508ms .. 15.6572ms]

@stuhood stuhood force-pushed the stuhood.reduce-top-n-sorting branch from f0f92aa to 36e1713 Compare July 13, 2025 23:34
black_box(searcher.search(&AllQuery, &collector).unwrap());
}
fn top_docs_small_deep(index: &Index) {
execute_top_docs::<u64>(index, "score", Order::Asc, 10000, 10);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: the benches all test Ascending order only from what I can see, so this function could either be removed or should add some runs in descending order as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean that you would prefer that the order: Order argument be removed? I'm fine with that, but it can also be useful to expose assumptions like that in the caller.

@@ -402,7 +402,7 @@ fn get_test_index_bench(cardinality: Cardinality) -> tantivy::Result<Index> {
.collect::<Vec<_>>();
{
let mut rng = StdRng::from_seed([1u8; 32]);
let mut index_writer = index.writer_with_num_threads(1, 200_000_000)?;
let mut index_writer = index.writer_with_num_threads(8, 200_000_000)?;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This changes the test to have 8 segments instead of 1

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's intentional. I believe that it should change to something other than 1, as it's definitely not realistic to have 1 segment in production.

It's very likely though that you would want the segments to be produced in some more deterministic order though, so if you'd rather I revert this here, that's totally fine.

fn top_docs_small_deep(index: &Index) {
execute_top_docs::<u64>(index, "score", Order::Asc, 10000, 10);
}
fn top_docs_small_shallow(index: &Index) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
fn top_docs_small_shallow(index: &Index) {
fn top_docs_top_10(index: &Index) {

fn top_docs_small_shallow(index: &Index) {
execute_top_docs::<u64>(index, "score", Order::Asc, 0, 10);
}
fn top_docs_large_deep(index: &Index) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
fn top_docs_large_deep(index: &Index) {
fn top_docs_top_1_000_skip_10_000(index: &Index) {

OptionalSparse = 3,
}

fn get_test_index_bench(cardinality: Cardinality) -> tantivy::Result<Index> {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

most fields are unused

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a function of having cloned this from agg_bench, which @fulmicoton suggested.

I'll be honest: I think that we would be better off renaming agg_bench to collector_bench, and then putting the TopDocs benchmark functions in there as well. Because aggregations are also "just" collectors, and both sets of benchmarks want to consume similar datasets.

If we want to keep them in two benchmark files, then how would you feel about adding an un-published benchmarks-support crate containing this function and other support code?

text_field_few_terms => "cool",
text_field_few_terms => "cool",
score_field => 1u64,
score_field => 1u64,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The value distribution is quite different for the multivalue case for the score field

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above: this is copypasta from agg_bench.

@stuhood
Copy link
Contributor Author

stuhood commented Jul 22, 2025

Ping on this one. It seems like the primary thing to figure out is where the benchmarks should live, and if they should be separated, then whether support code should be broken out into another crate.

@PSeitz, @fulmicoton: What do you prefer?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants