Fixes for `TopDocs::order_by_string_fast_field` and `TopNComputer` #2651

stuhood · 2025-06-17T04:21:58Z

#2642 added TopDocs::order_by_string_fast_field, but the test coverage in that PR did not catch two bugs:

The TopNComputer was refactored in feat(aggregators/metric): Add a top_hits aggregator #2198 to support both ascending and descending order. But that refactor did not adjust the TopNComputer::threshold to include the ComparableDoc, which is what provides the relevant Ord implementation.
- The effect of this was that the threshold comparison for TopNComputer was still hardcoded for descending order. Unit tests which tested with fewer than limit * 2 items OR which didn't have an item less than the threshold arrive after the first limit * 2 items (when the threshold is computed for the first time) would not expose the bug.
sorted_ords_to_term_cb only supported unique ords, likely because its previous consumers in aggregates would always de-dupe ords before lookup.
- It is much easier for consumers who would like to look up non-unique term ordinals to allow them to be duplicated (as it introduces only two comparisons in a path that is always IO bound).

src/collector/top_score_collector.rs

The `TopNComputer` was refactored in 0e04ec3 to support both ascending and descending order. But that refactor did not adjust the `TopNComputer::threshold` to include the `ComparableDoc`, which is what provides the relevant `Ord` implementation. Add a simple failing test, make it pass, and then expand the tests into proptests (which are already in use elsewhere in the codebase). Upstream at: quickwit-oss#2651

fulmicoton-dd · 2025-06-19T07:00:55Z

@stuhood I don't understand what this is about. Can you improve the description? In particular, the title suggests this is a bugfix... Is this really a bugfix? It then points to commits that are not in tantivy.

stuhood · 2025-06-19T21:44:57Z

@stuhood I don't understand what this is about. Can you improve the description? In particular, the title suggests this is a bugfix... Is this really a bugfix? It then points to commits that are not in tantivy.

Yes, it is a bugfix. Have improved the description.

I don't quite understand why that commit isn't rendering as part of Tantivy, but: it is #2198 ... which claims to have been merged as dd7e745 ... which the github UI claims is not in Tantivy? Strange.

fulmicoton · 2025-06-20T08:33:00Z

@PSeitz can you review?

src/collector/top_score_collector.rs

PSeitz-dd · 2025-06-24T09:29:10Z

Can you add a high level test, that would have failed?

PSeitz · 2025-06-24T10:56:11Z

I fixed the failing CI

…omparison.

stuhood · 2025-06-28T22:41:30Z

Have rebased and pushed two more commits:

Allowing sorted_ords_to_term_cb to handle duplicate (but still sorted) ords. We were using this patch downstream to allow for string sorting, but it hadn't made its way upstream, and caused the TopDocs::order_by_string_fast_field method added in Add string fast field support to TopDocs. #2642 to be broken for duplicate matches.
Adding a proptest for order_by_string_fast_field, which caught the above issue, as well as reproducing the issue with TopNComputer.

PSeitz · 2025-06-29T18:27:51Z

sstable/src/dictionary.rs

+            //
+            // this allows the input TermOrdinal iterator to contain duplicates, so long as it's
+            // still sorted
+            if Some(ord) != prev_ord {


This changes the logic and may cause some unexpected surprises

I think it would be better for the caller to filter duplicates in the incoming iterator, e.g. via dedup

This changes the logic and may cause some unexpected surprises

Duplicates were not supported before (you'd get an assertion failure), so it should only make a case work that didn't before? I can add additional tests if that would help.

I think it would be better for the caller to filter duplicates in the incoming iterator, e.g. via dedup

That makes it more difficult for consumers to then zip the results back to a collection: they need a mapping table or HashMap to look up offsets, rather than being able to literally zip them, as we do here:

tantivy/src/collector/top_score_collector.rs

Lines 230 to 232 in 988c2b3

terms

.into_iter()

.zip(top_ordinals)

Which assertion? So this one is not strict enough? assert!(ord >= current_ordinal);

cc @trinity-1686a

assert!(ord >= current_ordinal); would be the assertion that failed before. At the end of the main loop, current_ordinal is set to ord + 1, not ord. This is an artifact of how the sub loop works with current_ordinal..=ord, a range which isn't empty when current_ordinal==ord, and/or that current_ordinal is initialized to 0 without calling advance (so the current position is more like minus 1).

I think current_ordinal is a miss-nommer, it isn't the ordinal of the entry we have in bytes, it's the ordinal of the entry we will have after calling current_sstable_delta_reader.advance().

this change does fix the issue, but in a non-obvious way: if we pull for 4 (chosen by a fair dice roll) two times, the first iteration will bring current_ordinal to 5, the second will skip the assertion due to the newly added if, and do a for over 5..=4 which is an empty range, so everything works out. The fix has nothing to do with not calling get_block_with_ord() though, that function is stateless, it just skip the assertion in the specific case where it's too restrictive.

I think a better fix is to rename current_ordinal to next_ordinal, because that's what it is, and change the assertion to assert!(ord+1 >= next_ordinal);. There is probably an alternative fix where current_ordinal can keep its name, and current_sstable_delta_reader.advance() is called once just after initializing the delta reader, but i have a hard time wrapping my head around whether it's sound (not in the rust unsafety sense) when loading a new block

PSeitz · 2025-06-29T18:32:24Z

src/collector/top_score_collector.rs

@@ -913,7 +920,7 @@ pub struct TopNComputer<Score, D, const REVERSE_ORDER: bool = true> {
    /// The buffer reverses sort order to get top-semantics instead of bottom-semantics
    buffer: Vec<ComparableDoc<Score, D, REVERSE_ORDER>>,
    top_n: usize,
-    pub(crate) threshold: Option<Score>,
+    pub(crate) threshold: Option<ComparableDoc<Score, D, REVERSE_ORDER>>,


TopNComputer is quite performance sensitive. I think this changes the memory layout of the struct. Did you test the performance of the change?

It may cause a performance regression in TopDocs.

I've added some benchmarks in #2646: are you able to help me get that one landed?

We usually bench using https://github.com/quickwit-oss/search-benchmark-game. These benchmarks are highlevel enough to give a good idea of the performance.

I ran the benchmark and this causes a performance regression

fulmicoton-dd · 2025-07-15T01:34:03Z

@PSeitz @stuhood
If this is a bug in main we need to fix it right away.
If the PR is not ready for merge, let's revert the PR introducing the bug

PSeitz · 2025-07-15T02:05:01Z

@PSeitz @stuhood If this is a bug in main we need to fix it right away. If the PR is not ready for merge, let's revert the PR introducing the bug

As I understand it, this is a bug that exists since tantivy 0.22 and ordering by a fast field in ASC order is affected

stuhood · 2025-07-16T20:44:25Z

@PSeitz : Thanks for opening #2672. Have broken out #2673 for the rest of the fix here.

This was referenced Jun 17, 2025

Add string fast field support to TopDocs. #2642

Merged

Fix TopNComputer threshold comparison paradedb/tantivy#51

Merged

stuhood commented Jun 17, 2025

View reviewed changes

src/collector/top_score_collector.rs Show resolved Hide resolved

fulmicoton requested a review from PSeitz June 20, 2025 08:32

PSeitz-dd reviewed Jun 24, 2025

View reviewed changes

src/collector/top_score_collector.rs Outdated Show resolved Hide resolved

PSeitz-dd reviewed Jun 24, 2025

View reviewed changes

src/collector/top_score_collector.rs Show resolved Hide resolved

PSeitz-dd reviewed Jun 24, 2025

View reviewed changes

src/collector/top_score_collector.rs Show resolved Hide resolved

stuhood requested a review from PSeitz-dd June 27, 2025 04:23

stuhood and others added 6 commits June 28, 2025 13:31

Add a failing test.

64631f4

Hold onto the threshold as a ComparableDoc, and fix the threshold c…

e193bd6

…omparison.

Convert to a proptest.

e521047

Review feedback.

cda3557

adjust Dictionary::sorted_ords_to_term_cb() to allow duplicates (#8)

5613026

String ordering proptest.

e77e103

stuhood force-pushed the stuhood.topn-computer-asc-upstream branch from db31584 to e77e103 Compare June 28, 2025 22:30

stuhood changed the title ~~Fix TopNComputer threshold comparison~~ Fixes for TopDocs::order_by_string_fast_field and TopNComputer Jun 28, 2025

PSeitz reviewed Jun 29, 2025

View reviewed changes

PSeitz-dd mentioned this pull request Jul 16, 2025

Fix TopNComputer for reverse order #2672

Merged

stuhood closed this Jul 16, 2025

trinity-1686a mentioned this pull request Jul 17, 2025

Fix TopDocs::order_by_string_fast_field for duplicates #2673

Open

Uh oh!

Fixes for TopDocs::order_by_string_fast_field and TopNComputer #2651

Fixes for TopDocs::order_by_string_fast_field and TopNComputer #2651

Uh oh!

Conversation

stuhood commented Jun 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

fulmicoton-dd commented Jun 19, 2025

Uh oh!

stuhood commented Jun 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fulmicoton commented Jun 20, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

PSeitz-dd commented Jun 24, 2025

Uh oh!

PSeitz commented Jun 24, 2025

Uh oh!

stuhood commented Jun 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

PSeitz Jun 29, 2025

Choose a reason for hiding this comment

Uh oh!

stuhood Jul 1, 2025

Choose a reason for hiding this comment

Uh oh!

PSeitz Jul 1, 2025

Choose a reason for hiding this comment

Uh oh!

trinity-1686a Jul 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

PSeitz Jun 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

stuhood Jul 1, 2025

Choose a reason for hiding this comment

Uh oh!

PSeitz Jul 1, 2025

Choose a reason for hiding this comment

Uh oh!

PSeitz-dd Jul 11, 2025

Choose a reason for hiding this comment

Uh oh!

fulmicoton-dd commented Jul 15, 2025

Uh oh!

PSeitz commented Jul 15, 2025

Uh oh!

stuhood commented Jul 16, 2025

Uh oh!

Uh oh!

Fixes for `TopDocs::order_by_string_fast_field` and `TopNComputer` #2651

Fixes for `TopDocs::order_by_string_fast_field` and `TopNComputer` #2651

stuhood commented Jun 17, 2025 •

edited

Loading

stuhood commented Jun 19, 2025 •

edited

Loading

stuhood commented Jun 28, 2025 •

edited

Loading

trinity-1686a Jul 1, 2025 •

edited

Loading

PSeitz Jun 29, 2025 •

edited

Loading