Skip to content

Fixes for TopDocs::order_by_string_fast_field and TopNComputer #2651

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

stuhood
Copy link
Contributor

@stuhood stuhood commented Jun 17, 2025

#2642 added TopDocs::order_by_string_fast_field, but the test coverage in that PR did not catch two bugs:

  1. The TopNComputer was refactored in feat(aggregators/metric): Add a top_hits aggregator #2198 to support both ascending and descending order. But that refactor did not adjust the TopNComputer::threshold to include the ComparableDoc, which is what provides the relevant Ord implementation.
    • The effect of this was that the threshold comparison for TopNComputer was still hardcoded for descending order. Unit tests which tested with fewer than limit * 2 items OR which didn't have an item less than the threshold arrive after the first limit * 2 items (when the threshold is computed for the first time) would not expose the bug.
  2. sorted_ords_to_term_cb only supported unique ords, likely because its previous consumers in aggregates would always de-dupe ords before lookup.
    • It is much easier for consumers who would like to look up non-unique term ordinals to allow them to be duplicated (as it introduces only two comparisons in a path that is always IO bound).

stuhood added a commit to paradedb/tantivy that referenced this pull request Jun 17, 2025
The `TopNComputer` was refactored in 0e04ec3 to support both ascending and descending order. But that refactor did not adjust the `TopNComputer::threshold` to include the `ComparableDoc`, which is what provides the relevant `Ord` implementation.

Add a simple failing test, make it pass, and then expand the tests into proptests (which are already in use elsewhere in the codebase).

Upstream at: quickwit-oss#2651
@fulmicoton-dd
Copy link
Collaborator

@stuhood I don't understand what this is about. Can you improve the description? In particular, the title suggests this is a bugfix... Is this really a bugfix? It then points to commits that are not in tantivy.

@stuhood
Copy link
Contributor Author

stuhood commented Jun 19, 2025

@stuhood I don't understand what this is about. Can you improve the description? In particular, the title suggests this is a bugfix... Is this really a bugfix? It then points to commits that are not in tantivy.

Yes, it is a bugfix. Have improved the description.

I don't quite understand why that commit isn't rendering as part of Tantivy, but: it is #2198 ... which claims to have been merged as dd7e745 ... which the github UI claims is not in Tantivy? Strange.

@fulmicoton fulmicoton requested a review from PSeitz June 20, 2025 08:32
@fulmicoton
Copy link
Collaborator

@PSeitz can you review?

@PSeitz-dd
Copy link
Contributor

Can you add a high level test, that would have failed?

@PSeitz
Copy link
Collaborator

PSeitz commented Jun 24, 2025

I fixed the failing CI

@stuhood stuhood requested a review from PSeitz-dd June 27, 2025 04:23
@stuhood stuhood force-pushed the stuhood.topn-computer-asc-upstream branch from db31584 to e77e103 Compare June 28, 2025 22:30
@stuhood stuhood changed the title Fix TopNComputer threshold comparison Fixes for TopDocs::order_by_string_fast_field and TopNComputer Jun 28, 2025
@stuhood
Copy link
Contributor Author

stuhood commented Jun 28, 2025

Have rebased and pushed two more commits:

  • Allowing sorted_ords_to_term_cb to handle duplicate (but still sorted) ords. We were using this patch downstream to allow for string sorting, but it hadn't made its way upstream, and caused the TopDocs::order_by_string_fast_field method added in Add string fast field support to TopDocs. #2642 to be broken for duplicate matches.
  • Adding a proptest for order_by_string_fast_field, which caught the above issue, as well as reproducing the issue with TopNComputer.

//
// this allows the input TermOrdinal iterator to contain duplicates, so long as it's
// still sorted
if Some(ord) != prev_ord {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This changes the logic and may cause some unexpected surprises

I think it would be better for the caller to filter duplicates in the incoming iterator, e.g. via dedup

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This changes the logic and may cause some unexpected surprises

Duplicates were not supported before (you'd get an assertion failure), so it should only make a case work that didn't before? I can add additional tests if that would help.

I think it would be better for the caller to filter duplicates in the incoming iterator, e.g. via dedup

That makes it more difficult for consumers to then zip the results back to a collection: they need a mapping table or HashMap to look up offsets, rather than being able to literally zip them, as we do here:

terms
.into_iter()
.zip(top_ordinals)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which assertion? So this one is not strict enough? assert!(ord >= current_ordinal);

cc @trinity-1686a

Copy link
Contributor

@trinity-1686a trinity-1686a Jul 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assert!(ord >= current_ordinal); would be the assertion that failed before. At the end of the main loop, current_ordinal is set to ord + 1, not ord. This is an artifact of how the sub loop works with current_ordinal..=ord, a range which isn't empty when current_ordinal==ord, and/or that current_ordinal is initialized to 0 without calling advance (so the current position is more like minus 1).

I think current_ordinal is a miss-nommer, it isn't the ordinal of the entry we have in bytes, it's the ordinal of the entry we will have after calling current_sstable_delta_reader.advance().

this change does fix the issue, but in a non-obvious way: if we pull for 4 (chosen by a fair dice roll) two times, the first iteration will bring current_ordinal to 5, the second will skip the assertion due to the newly added if, and do a for over 5..=4 which is an empty range, so everything works out. The fix has nothing to do with not calling get_block_with_ord() though, that function is stateless, it just skip the assertion in the specific case where it's too restrictive.

I think a better fix is to rename current_ordinal to next_ordinal, because that's what it is, and change the assertion to assert!(ord+1 >= next_ordinal);. There is probably an alternative fix where current_ordinal can keep its name, and current_sstable_delta_reader.advance() is called once just after initializing the delta reader, but i have a hard time wrapping my head around whether it's sound (not in the rust unsafety sense) when loading a new block

@@ -913,7 +920,7 @@ pub struct TopNComputer<Score, D, const REVERSE_ORDER: bool = true> {
/// The buffer reverses sort order to get top-semantics instead of bottom-semantics
buffer: Vec<ComparableDoc<Score, D, REVERSE_ORDER>>,
top_n: usize,
pub(crate) threshold: Option<Score>,
pub(crate) threshold: Option<ComparableDoc<Score, D, REVERSE_ORDER>>,
Copy link
Collaborator

@PSeitz PSeitz Jun 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TopNComputer is quite performance sensitive. I think this changes the memory layout of the struct. Did you test the performance of the change?

It may cause a performance regression in TopDocs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added some benchmarks in #2646: are you able to help me get that one landed?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We usually bench using https://github.com/quickwit-oss/search-benchmark-game. These benchmarks are highlevel enough to give a good idea of the performance.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran the benchmark and this causes a performance regression

Screenshot 2025-07-11 at 15 30 09

@fulmicoton-dd
Copy link
Collaborator

@PSeitz @stuhood
If this is a bug in main we need to fix it right away.
If the PR is not ready for merge, let's revert the PR introducing the bug

@PSeitz
Copy link
Collaborator

PSeitz commented Jul 15, 2025

@PSeitz @stuhood If this is a bug in main we need to fix it right away. If the PR is not ready for merge, let's revert the PR introducing the bug

As I understand it, this is a bug that exists since tantivy 0.22 and ordering by a fast field in ASC order is affected

@stuhood
Copy link
Contributor Author

stuhood commented Jul 16, 2025

@PSeitz : Thanks for opening #2672. Have broken out #2673 for the rest of the fix here.

@stuhood stuhood closed this Jul 16, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants