-
Notifications
You must be signed in to change notification settings - Fork 19
Enabler for proper de-duplication of query results in tiered index with compression [MOD-10062] #692
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…ions, allowing inheriting classes to control whether the merge should use sets.
…ent scores in indexes and validate results
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR refactors the query search test functions and tiered index implementation to properly de-duplicate query results by leveraging template flags and new helper functions while also updating the external API for label counts. Key changes include:
- Removal of the expected result count parameter from runTopKSearchTest and introduction of validateTopKSearchTest.
- Addition of templated test helper functions (runTopKTieredIndexSearchTest and runRangeTieredIndexSearchTest) for tiered index queries.
- Updates to the tiered index (topKQueryImp/rangeQueryImp) to use a template flag (withSet) for optimized merge logic and improved locking in indexLabelCount.
Reviewed Changes
Copilot reviewed 13 out of 13 changed files in this pull request and generated no comments.
Show a summary per file
File | Description |
---|---|
tests/unit/unit_test_utils.h | Updated function declarations; removed overloaded runTopKSearchTest, and added templated helpers. |
tests/unit/unit_test_utils.cpp | Refactored search test functions, moved expected result count logic to validateTopKSearchTest, and added templated implementations. |
tests/unit/test_hnsw_tiered.cpp | Adjusted test calls to match the new runTopKSearchTest signature. |
tests/unit/test_common.cpp | Added a new test (SearchDifferentScores) to verify de-duplication and proper score ordering. |
tests/benchmark/types_ranges.h | Reformatted macros for range parameters. |
src/VecSim/vec_sim_tiered_index.h | Introduced templated query functions (topKQueryImp and rangeQueryImp) with improved comments. |
src/VecSim/vec_sim.h | Added documentation for VecSimIndex_IndexLabelCount. |
src/VecSim/vec_sim.cpp | Added extern "C" wrapper for VecSimIndex_IndexLabelCount. |
src/VecSim/utils/query_result_utils.h | Enhanced comments in merge_result_lists to explain the withSet flag usage. |
src/VecSim/algorithms/hnsw/hnsw_tiered_tests_friends.h | Added a test friend declaration for the new search test. |
src/VecSim/algorithms/hnsw/hnsw_tiered.h | Updated indexLabelCount to use shared locks and improved locking for thread safety. |
Comments suppressed due to low confidence (2)
tests/unit/unit_test_utils.cpp:100
- Review the change from using VecSimIndex_IndexSize to VecSimIndex_IndexLabelCount for determining expected result count; ensure this accurately reflects the intended de-duplication behavior for indices with duplicate labels.
validateTopKSearchTest(index, res, k, ResCB);
src/VecSim/algorithms/hnsw/hnsw_tiered.h:698
- Ensure that the shared lock acquisition order (flatIndexGuard, mainIndexGuard, then shared index data guard) is consistent with other parts of the codebase to avoid potential deadlocks.
this->flatIndexGuard.lock_shared();
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #692 +/- ##
==========================================
+ Coverage 96.29% 96.44% +0.14%
==========================================
Files 111 121 +10
Lines 6288 6719 +431
==========================================
+ Hits 6055 6480 +425
- Misses 233 239 +6 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Hi!
|
This is not possible, since we mark vectors as deleted in the backend index once we are overwriting them. You cannot get the old |
Describe the changes in the pull request
(replacing #689 so we can merge this directly to main)
This PR changes the tiered index KNN and range search implementation to properly de-duplicate query results when the two underlying indexes have the same label with different scores. This is possible in multi, butt will also be possible for indexes with compression where full size vectors are stored in flat bugger and compressed vectors are in the backend index.
To test this change, we create a tiered HNSW index (the only tiered index that exists today) and inject labels with different vectors into its two tiers.
This also includes:
runTopKTieredIndexSearchTest
call that calls the internal search function that can searchwithSet
and filter duplicate labels results with different scores.Mark if applicable