Enabler for proper de-duplication of query results in tiered index with compression [MOD-10062] #692

alonre24 · 2025-06-08T20:31:46Z

Describe the changes in the pull request

(replacing #689 so we can merge this directly to main)
This PR changes the tiered index KNN and range search implementation to properly de-duplicate query results when the two underlying indexes have the same label with different scores. This is possible in multi, butt will also be possible for indexes with compression where full size vectors are stored in flat bugger and compressed vectors are in the backend index.

To test this change, we create a tiered HNSW index (the only tiered index that exists today) and inject labels with different vectors into its two tiers.

This also includes:

Exposing the API for getting the number of unique labels in an index, and relaxing the locking in the tiered HNSW implementation.
Removal of the expected result count parameter from runTopKSearchTes and the equivalent for range search, adding a new runTopKTieredIndexSearchTest call that calls the internal search function that can search withSet and filter duplicate labels results with different scores.
Install the latest clang-18 in CI to the local format will suits the remote one.

Mark if applicable

This PR introduces API changes
This PR introduces serialization changes

…ions, allowing inheriting classes to control whether the merge should use sets.

…ent scores in indexes and validate results

Copilot

Pull Request Overview

This PR refactors the query search test functions and tiered index implementation to properly de-duplicate query results by leveraging template flags and new helper functions while also updating the external API for label counts. Key changes include:

Removal of the expected result count parameter from runTopKSearchTest and introduction of validateTopKSearchTest.
Addition of templated test helper functions (runTopKTieredIndexSearchTest and runRangeTieredIndexSearchTest) for tiered index queries.
Updates to the tiered index (topKQueryImp/rangeQueryImp) to use a template flag (withSet) for optimized merge logic and improved locking in indexLabelCount.

Reviewed Changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated no comments.

Show a summary per file

File	Description
tests/unit/unit_test_utils.h	Updated function declarations; removed overloaded runTopKSearchTest, and added templated helpers.
tests/unit/unit_test_utils.cpp	Refactored search test functions, moved expected result count logic to validateTopKSearchTest, and added templated implementations.
tests/unit/test_hnsw_tiered.cpp	Adjusted test calls to match the new runTopKSearchTest signature.
tests/unit/test_common.cpp	Added a new test (SearchDifferentScores) to verify de-duplication and proper score ordering.
tests/benchmark/types_ranges.h	Reformatted macros for range parameters.
src/VecSim/vec_sim_tiered_index.h	Introduced templated query functions (topKQueryImp and rangeQueryImp) with improved comments.
src/VecSim/vec_sim.h	Added documentation for VecSimIndex_IndexLabelCount.
src/VecSim/vec_sim.cpp	Added extern "C" wrapper for VecSimIndex_IndexLabelCount.
src/VecSim/utils/query_result_utils.h	Enhanced comments in merge_result_lists to explain the withSet flag usage.
src/VecSim/algorithms/hnsw/hnsw_tiered_tests_friends.h	Added a test friend declaration for the new search test.
src/VecSim/algorithms/hnsw/hnsw_tiered.h	Updated indexLabelCount to use shared locks and improved locking for thread safety.

Comments suppressed due to low confidence (2)

tests/unit/unit_test_utils.cpp:100

Review the change from using VecSimIndex_IndexSize to VecSimIndex_IndexLabelCount for determining expected result count; ensure this accurately reflects the intended de-duplication behavior for indices with duplicate labels.

validateTopKSearchTest(index, res, k, ResCB);

src/VecSim/algorithms/hnsw/hnsw_tiered.h:698

Ensure that the shared lock acquisition order (flatIndexGuard, mainIndexGuard, then shared index data guard) is consistent with other parts of the codebase to avoid potential deadlocks.

this->flatIndexGuard.lock_shared();

codecov · 2025-06-09T10:08:57Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 96.44%. Comparing base (5ba2efa) to head (ece55e3).
Report is 4 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #692      +/-   ##
==========================================
+ Coverage   96.29%   96.44%   +0.14%     
==========================================
  Files         111      121      +10     
  Lines        6288     6719     +431     
==========================================
+ Hits         6055     6480     +425     
- Misses        233      239       +6

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

rfsaliev · 2025-06-09T13:20:26Z

Hi!
I have the question, how the 'Singe-Vector' mode 'topKQuery{k=2}' should work in the following case:

let q - query vector
let v1; v2; v3; v4; v5 are vectors which distance to q in appropriate order: v1 is closer than v2, etc.
let assume, that during some index updates, we got the case when: frontend contains: { id1:v4; id2:v5 }, backend: { id1:v1; id2:v2; id3:v3 }
top2 frontend: {id1, id2}, top2 backend: {id1, id2}.
id1 and id2 in backend are outdated, so we should select vectors from frontend: {v4, v5}, but backend contains valid vector {v3} which is closer than v4 and v5, but not received from backend top2 query.

alonre24 · 2025-06-09T15:13:14Z

Hi! I have the question, how the 'Singe-Vector' mode 'topKQuery{k=2}' should work in the following case:

let q - query vector

let v1; v2; v3; v4; v5 are vectors which distance to q in appropriate order: v1 is closer than v2, etc.

let assume, that during some index updates, we got the case when: frontend contains: { id1:v4; id2:v5 }, backend: { id1:v1; id2:v2; id3:v3 }

top2 frontend: {id1, id2}, top2 backend: {id1, id2}.

id1 and id2 in backend are outdated, so we should select vectors from frontend: {v4, v5}, but backend contains valid vector {v3} which is closer than v4 and v5, but not received from backend top2 query.

This is not possible, since we mark vectors as deleted in the backend index once we are overwriting them. You cannot get the old v1 in the backend under id1

src/VecSim/vec_sim_tiered_index.h

tests/unit/test_common.cpp

meiravgri and others added 5 commits June 8, 2025 14:12

topKImp

c4c74a8

In TieredIndex: move topKQuery and rangeQuery to templated *Imp funct…

a4dc95d

…ions, allowing inheriting classes to control whether the merge should use sets.

Add test with engineered tiered HNSW that has same labels with differ…

049e665

…ent scores in indexes and validate results

Simplify test and format properly

779449d

revert clang18

56cbd83

alonre24 changed the title ~~expose_topk_withset~~ Enabler for proper de-duplication of query results in tiered index with compression Jun 8, 2025

alonre24 requested a review from Copilot June 8, 2025 21:01

Copilot AI reviewed Jun 8, 2025

View reviewed changes

alonre24 added 3 commits June 9, 2025 11:57

reformat

206c218

Install latest clang-18 version in ci

1eb51be

remove leftover

2f15ba1

alonre24 requested review from rfsaliev and GuyAv46 June 9, 2025 09:21

GuyAv46 reviewed Jun 9, 2025

View reviewed changes

src/VecSim/vec_sim_tiered_index.h Show resolved Hide resolved

tests/unit/test_common.cpp Outdated Show resolved Hide resolved

address CR

ece55e3

alonre24 requested a review from GuyAv46 June 9, 2025 16:04

GuyAv46 approved these changes Jun 9, 2025

View reviewed changes

alonre24 added this pull request to the merge queue Jun 9, 2025

Merged via the queue into main with commit ca1137a Jun 9, 2025
17 checks passed

alonre24 deleted the expose_topk_withset branch June 9, 2025 19:04

alonre24 mentioned this pull request Jun 10, 2025

Enabler for proper de-duplication of query results in tiered index with compression #689

Closed

2 tasks

alonre24 changed the title ~~Enabler for proper de-duplication of query results in tiered index with compression~~ Enabler for proper de-duplication of query results in tiered index with compression [MOD-10062] Jun 10, 2025

alonre24 mentioned this pull request Jun 10, 2025

[SVS] Add Tiered SVS index implementation #631

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Enabler for proper de-duplication of query results in tiered index with compression [MOD-10062] #692

Enabler for proper de-duplication of query results in tiered index with compression [MOD-10062] #692

Uh oh!

alonre24 commented Jun 8, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

codecov bot commented Jun 9, 2025 •

edited

Loading

Uh oh!

rfsaliev commented Jun 9, 2025

Uh oh!

alonre24 commented Jun 9, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Enabler for proper de-duplication of query results in tiered index with compression [MOD-10062] #692

Enabler for proper de-duplication of query results in tiered index with compression [MOD-10062] #692

Uh oh!

Conversation

alonre24 commented Jun 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

codecov bot commented Jun 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

rfsaliev commented Jun 9, 2025

Uh oh!

alonre24 commented Jun 9, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

alonre24 commented Jun 8, 2025 •

edited

Loading

codecov bot commented Jun 9, 2025 •

edited

Loading