-
-
Notifications
You must be signed in to change notification settings - Fork 781
Fixes for TopDocs::order_by_string_fast_field
and TopNComputer
#2651
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixes for TopDocs::order_by_string_fast_field
and TopNComputer
#2651
Conversation
The `TopNComputer` was refactored in 0e04ec3 to support both ascending and descending order. But that refactor did not adjust the `TopNComputer::threshold` to include the `ComparableDoc`, which is what provides the relevant `Ord` implementation. Add a simple failing test, make it pass, and then expand the tests into proptests (which are already in use elsewhere in the codebase). Upstream at: quickwit-oss#2651
@stuhood I don't understand what this is about. Can you improve the description? In particular, the title suggests this is a bugfix... Is this really a bugfix? It then points to commits that are not in tantivy. |
Yes, it is a bugfix. Have improved the description. I don't quite understand why that commit isn't rendering as part of Tantivy, but: it is #2198 ... which claims to have been merged as dd7e745 ... which the github UI claims is not in Tantivy? Strange. |
@PSeitz can you review? |
Can you add a high level test, that would have failed? |
I fixed the failing CI |
db31584
to
e77e103
Compare
TopNComputer
threshold comparisonTopDocs::order_by_string_fast_field
and TopNComputer
Have rebased and pushed two more commits:
|
// | ||
// this allows the input TermOrdinal iterator to contain duplicates, so long as it's | ||
// still sorted | ||
if Some(ord) != prev_ord { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This changes the logic and may cause some unexpected surprises
I think it would be better for the caller to filter duplicates in the incoming iterator, e.g. via dedup
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This changes the logic and may cause some unexpected surprises
Duplicates were not supported before (you'd get an assertion failure), so it should only make a case work that didn't before? I can add additional tests if that would help.
I think it would be better for the caller to filter duplicates in the incoming iterator, e.g. via dedup
That makes it more difficult for consumers to then zip the results back to a collection: they need a mapping table or HashMap
to look up offsets, rather than being able to literally zip
them, as we do here:
tantivy/src/collector/top_score_collector.rs
Lines 230 to 232 in 988c2b3
terms | |
.into_iter() | |
.zip(top_ordinals) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which assertion? So this one is not strict enough? assert!(ord >= current_ordinal);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assert!(ord >= current_ordinal);
would be the assertion that failed before. At the end of the main loop, current_ordinal is set to ord + 1
, not ord. This is an artifact of how the sub loop works with current_ordinal..=ord
, a range which isn't empty when current_ordinal==ord
, and/or that current_ordinal is initialized to 0 without calling advance (so the current position is more like minus 1).
I think current_ordinal
is a miss-nommer, it isn't the ordinal of the entry we have in bytes
, it's the ordinal of the entry we will have after calling current_sstable_delta_reader.advance()
.
this change does fix the issue, but in a non-obvious way: if we pull for 4 (chosen by a fair dice roll) two times, the first iteration will bring current_ordinal
to 5, the second will skip the assertion due to the newly added if, and do a for over 5..=4
which is an empty range, so everything works out. The fix has nothing to do with not calling get_block_with_ord()
though, that function is stateless, it just skip the assertion in the specific case where it's too restrictive.
I think a better fix is to rename current_ordinal
to next_ordinal
, because that's what it is, and change the assertion to assert!(ord+1 >= next_ordinal);
. There is probably an alternative fix where current_ordinal can keep its name, and current_sstable_delta_reader.advance()
is called once just after initializing the delta reader, but i have a hard time wrapping my head around whether it's sound (not in the rust unsafety sense) when loading a new block
@@ -913,7 +920,7 @@ pub struct TopNComputer<Score, D, const REVERSE_ORDER: bool = true> { | |||
/// The buffer reverses sort order to get top-semantics instead of bottom-semantics | |||
buffer: Vec<ComparableDoc<Score, D, REVERSE_ORDER>>, | |||
top_n: usize, | |||
pub(crate) threshold: Option<Score>, | |||
pub(crate) threshold: Option<ComparableDoc<Score, D, REVERSE_ORDER>>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TopNComputer
is quite performance sensitive. I think this changes the memory layout of the struct. Did you test the performance of the change?
It may cause a performance regression in TopDocs
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added some benchmarks in #2646: are you able to help me get that one landed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We usually bench using https://github.com/quickwit-oss/search-benchmark-game. These benchmarks are highlevel enough to give a good idea of the performance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#2642 added
TopDocs::order_by_string_fast_field
, but the test coverage in that PR did not catch two bugs:TopNComputer
was refactored in feat(aggregators/metric): Add a top_hits aggregator #2198 to support both ascending and descending order. But that refactor did not adjust theTopNComputer::threshold
to include theComparableDoc
, which is what provides the relevantOrd
implementation.TopNComputer
was still hardcoded for descending order. Unit tests which tested with fewer thanlimit * 2
items OR which didn't have an item less than the threshold arrive after the firstlimit * 2
items (when the threshold is computed for the first time) would not expose the bug.sorted_ords_to_term_cb
only supported uniqueords
, likely because its previous consumers in aggregates would always de-dupe ords before lookup.