Skip to content

Enabler for proper de-duplication of query results in tiered index with compression [MOD-10062] #692

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
Jun 9, 2025

Conversation

alonre24
Copy link
Collaborator

@alonre24 alonre24 commented Jun 8, 2025

Describe the changes in the pull request

(replacing #689 so we can merge this directly to main)
This PR changes the tiered index KNN and range search implementation to properly de-duplicate query results when the two underlying indexes have the same label with different scores. This is possible in multi, butt will also be possible for indexes with compression where full size vectors are stored in flat bugger and compressed vectors are in the backend index.

To test this change, we create a tiered HNSW index (the only tiered index that exists today) and inject labels with different vectors into its two tiers.

This also includes:

  • Exposing the API for getting the number of unique labels in an index, and relaxing the locking in the tiered HNSW implementation.
  • Removal of the expected result count parameter from runTopKSearchTes and the equivalent for range search, adding a new runTopKTieredIndexSearchTest call that calls the internal search function that can search withSet and filter duplicate labels results with different scores.
  • Install the latest clang-18 in CI to the local format will suits the remote one.

Mark if applicable

  • This PR introduces API changes
  • This PR introduces serialization changes

@alonre24 alonre24 changed the title expose_topk_withset Enabler for proper de-duplication of query results in tiered index with compression Jun 8, 2025
@alonre24 alonre24 requested a review from Copilot June 8, 2025 21:01
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR refactors the query search test functions and tiered index implementation to properly de-duplicate query results by leveraging template flags and new helper functions while also updating the external API for label counts. Key changes include:

  • Removal of the expected result count parameter from runTopKSearchTest and introduction of validateTopKSearchTest.
  • Addition of templated test helper functions (runTopKTieredIndexSearchTest and runRangeTieredIndexSearchTest) for tiered index queries.
  • Updates to the tiered index (topKQueryImp/rangeQueryImp) to use a template flag (withSet) for optimized merge logic and improved locking in indexLabelCount.

Reviewed Changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated no comments.

Show a summary per file
File Description
tests/unit/unit_test_utils.h Updated function declarations; removed overloaded runTopKSearchTest, and added templated helpers.
tests/unit/unit_test_utils.cpp Refactored search test functions, moved expected result count logic to validateTopKSearchTest, and added templated implementations.
tests/unit/test_hnsw_tiered.cpp Adjusted test calls to match the new runTopKSearchTest signature.
tests/unit/test_common.cpp Added a new test (SearchDifferentScores) to verify de-duplication and proper score ordering.
tests/benchmark/types_ranges.h Reformatted macros for range parameters.
src/VecSim/vec_sim_tiered_index.h Introduced templated query functions (topKQueryImp and rangeQueryImp) with improved comments.
src/VecSim/vec_sim.h Added documentation for VecSimIndex_IndexLabelCount.
src/VecSim/vec_sim.cpp Added extern "C" wrapper for VecSimIndex_IndexLabelCount.
src/VecSim/utils/query_result_utils.h Enhanced comments in merge_result_lists to explain the withSet flag usage.
src/VecSim/algorithms/hnsw/hnsw_tiered_tests_friends.h Added a test friend declaration for the new search test.
src/VecSim/algorithms/hnsw/hnsw_tiered.h Updated indexLabelCount to use shared locks and improved locking for thread safety.
Comments suppressed due to low confidence (2)

tests/unit/unit_test_utils.cpp:100

  • Review the change from using VecSimIndex_IndexSize to VecSimIndex_IndexLabelCount for determining expected result count; ensure this accurately reflects the intended de-duplication behavior for indices with duplicate labels.
validateTopKSearchTest(index, res, k, ResCB);

src/VecSim/algorithms/hnsw/hnsw_tiered.h:698

  • Ensure that the shared lock acquisition order (flatIndexGuard, mainIndexGuard, then shared index data guard) is consistent with other parts of the codebase to avoid potential deadlocks.
this->flatIndexGuard.lock_shared();

@alonre24 alonre24 requested review from rfsaliev and GuyAv46 June 9, 2025 09:21
Copy link

codecov bot commented Jun 9, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 96.44%. Comparing base (5ba2efa) to head (ece55e3).
Report is 4 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #692      +/-   ##
==========================================
+ Coverage   96.29%   96.44%   +0.14%     
==========================================
  Files         111      121      +10     
  Lines        6288     6719     +431     
==========================================
+ Hits         6055     6480     +425     
- Misses        233      239       +6     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@rfsaliev
Copy link
Collaborator

rfsaliev commented Jun 9, 2025

Hi!
I have the question, how the 'Singe-Vector' mode 'topKQuery{k=2}' should work in the following case:

  • let q - query vector
  • let v1; v2; v3; v4; v5 are vectors which distance to q in appropriate order: v1 is closer than v2, etc.
  • let assume, that during some index updates, we got the case when: frontend contains: { id1:v4; id2:v5 }, backend: { id1:v1; id2:v2; id3:v3 }
  • top2 frontend: {id1, id2}, top2 backend: {id1, id2}.
  • id1 and id2 in backend are outdated, so we should select vectors from frontend: {v4, v5}, but backend contains valid vector {v3} which is closer than v4 and v5, but not received from backend top2 query.

@alonre24
Copy link
Collaborator Author

alonre24 commented Jun 9, 2025

Hi! I have the question, how the 'Singe-Vector' mode 'topKQuery{k=2}' should work in the following case:

  • let q - query vector
  • let v1; v2; v3; v4; v5 are vectors which distance to q in appropriate order: v1 is closer than v2, etc.
  • let assume, that during some index updates, we got the case when: frontend contains: { id1:v4; id2:v5 }, backend: { id1:v1; id2:v2; id3:v3 }
  • top2 frontend: {id1, id2}, top2 backend: {id1, id2}.
  • id1 and id2 in backend are outdated, so we should select vectors from frontend: {v4, v5}, but backend contains valid vector {v3} which is closer than v4 and v5, but not received from backend top2 query.

This is not possible, since we mark vectors as deleted in the backend index once we are overwriting them. You cannot get the old v1 in the backend under id1

@alonre24 alonre24 requested a review from GuyAv46 June 9, 2025 16:04
@alonre24 alonre24 added this pull request to the merge queue Jun 9, 2025
Merged via the queue into main with commit ca1137a Jun 9, 2025
17 checks passed
@alonre24 alonre24 deleted the expose_topk_withset branch June 9, 2025 19:04
@alonre24 alonre24 changed the title Enabler for proper de-duplication of query results in tiered index with compression Enabler for proper de-duplication of query results in tiered index with compression [MOD-10062] Jun 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants