addressing non-vector search issues #400

yairgott · 2025-09-30T16:25:14Z

The PR fixes multiple issues in the non-vector search code flow:

A call to the prefilter evaluator to evaluate the keys against the filter.
Acquires the necessary time slice read lock while processing non-vector search requests.
Adds the missing call to the MaybeAddIndexedContent function. This ensures the fetching of attribute values (for returned fields) is done by the worker threads using the values store in the indexers. This boosts query performance by avoiding the need to fetch the attributes values by the main thread if possible.

In addition, thw PR contains the following cosmetic changes:

The core logic for keys iteration, previously duplicated, has been unified into a single, reusable function: EvaluatePrefilteredKeys. This function now supports both non-vector search and vector prefilter flows.
Renaming:
- VectorSearchParameters has been renamed to SearchParameters.
- InlineVectorEvaluator has been renamed to PrefilterEvaluator.

Signed-off-by: Yair Gottdenker <yairg@google.com>

KarthikSubbarao · 2025-10-01T15:24:08Z

src/query/search.cc

+    return true;
+  };
+
+  EvaluatePrefilteredKeys(parameters, entries_fetchers,


Just to clarify the ordering:

We first perform the search using EvaluateFilterAsPrimary which creates the entries fetchers. We know this has an issue with AND (dropping lhs or rhs) and OR (duplicates keys).

Then we call EvaluatePrefilteredKeys uses an InlineSearch (now, called PrefilterEvaluator). This uses a flat_hash_set to prevent duplicate keys.

Later on during response collection, we call VerifyFilter which uses the PredicateEvaluator.

One question I have is how the AND predicate issue in EvaluateFilterAsPrimary is addressed. Is it possible that we don't include one of the lhs/rhs?

if (predicate_type == PredicateType::kComposedAnd) { if (lhs < rhs) { AppendQueue(entries_fetchers, lhs_entries_fetchers); return lhs; } AppendQueue(entries_fetchers, rhs_entries_fetchers); return rhs; }

We know this has an issue with AND (dropping lhs or rhs) and OR (duplicates keys).

Just to be sure we're on the same page. This is how it is designed to work. It's designed to ensure that the decision whether to perform inline-filtering or pre-filtering is light-weighted by avoiding the overhead of fully executing the pre-filter, but rather providing a light-weighted estimate, which is the upper bound, of entries matching the filter. If it were to full apply the pre-filter, the overhead of the decision making becomes significantly higher.

Then we call EvaluatePrefilteredKeys uses an InlineSearch (now, called PrefilterEvaluator). This uses a flat_hash_set to prevent duplicate keys.

Clarifications:

In addition to preventing duplications, it also evaluates every entry returned by EvaluateFilterAsPrimary against the filter. As elaborated below, VerifyFilter addresses a very specific issue and should not be used to to do the filtering but rather the bare minimum amount of work as possible. This is because VerifyFilter is called by the main thread, which is much less effective, rather by the worker threads.

For the non-vector queries, it's important to keep in mind that the search subsystem is designed to support this use-case very efficiently:

Each non-vector index type is designed to return the amount of qualifying entries efficiently, without the need to iterate over subsets of entries. For instance the Tags index is implemented as a balanced Statistic Tree where each node maintains its child count.

Each non-vector index type exposes a search interface which returns the amount of qualifying entries and a fetcher iterator. A fetcher iterator is very light-weighted, it doesn't maintain the qualifying entries set but rather just a way to fetch them through iteration.

EvaluateFilterAsPrimary returns the set of fetcher iterators matching a sub filter with the minimum cardinality. For instance (a | b) & (c | d) , the returned set references a , b if | a | + | b | < | c | + | d | and vise-versa.

Evaluating the minimal amount of entries, through iterating over the returned fetcher iterators ensures the least amount of evaluation overhead.

Search calls MaybeAddIndexedContent now. See my third bullet point in the PR description.

Later on during response collection, we call VerifyFilter which uses the PredicateEvaluator.

The only reason that VerifyFilter is called is to ensure the validity of the returned entries, accounting for keyspace mutations during query execution. Important to note that It is only called if the query return clause contains attributes which are not indexed. An issue arises when a response contain entries which their attribute values were modified during the query execution, in which case the mutations are only recorded by the keyspace but yet by the indexes. This may result in returning attribute values, fresh from the keyspace, which the filter doesn't match.
As an example: consider that a hash entry contains 2 attributes: f1, f2 where just f1 is indexed, say as a numeric field. A query uses a pre-filter with f1 equals to v1_1 value and specify f2 as the returned attribute. Just before the query is executed, the value of f1 was v1_1 while f2 value was v2_1, this state is captured by both the Keyspace and the indexes at this time. During the query execution, these values were changed to v1_2 and v2_2. If we were to just return the entry with the value of v2_2, we would had returned an invalid entry because there weren't any point of time where f1 had the value of v1_1 while f2 had the value of v2_2.

To address this, the process involves fetching all the attributes, specified by either the filter or the return clause, from the Keyspace. The Keyspace may hold a newer version of the fetched attributes which don't matches the filter, therefore we re-evaluate the entries against the filter, by calling VerifyFilter, with the most up to date attribute values before responding.

allenss-amazon · 2025-10-04T19:32:17Z

src/query/search.cc

-        vector_index->AddPrefilteredKey(parameters.query, parameters.k, *key,
-                                        results, top_keys);
+      const auto &key = **iterator;
+      // TODO: add a bloom filter to ensure distinct keys are evaluated


Isn't the comment wrong now?

No, it's still valid. result_keys contains the keys which are currently tracked by the appender result set. But it might be that a key matched the filter, added to the appended result set and later was removed (with vector search). The bloom filter would make sure that keys which were removed from the result set are skipped (avoiding the need to evaluate the key and later skip appending it).

Signed-off-by: Nivesh <110855046+Nivesh-01@users.noreply.github.com>

authored-by: Yair Gottdenker <yairg@google.com> Signed-off-by: Nivesh <110855046+Nivesh-01@users.noreply.github.com>

allenss-amazon · 2025-10-15T02:05:04Z

PR #416 replaces this PR, same code, but updated to fix merge conflicts.

addressing non-vector search issues

1b93be6

Signed-off-by: Yair Gottdenker <yairg@google.com>

yairgott requested review from KarthikSubbarao, allenss-amazon and yulazariy September 30, 2025 16:25

Yair Gottdenker added 3 commits September 30, 2025 16:35

lint

817ec9b

Signed-off-by: Yair Gottdenker <yairg@google.com>

lint

05f1fbb

Signed-off-by: Yair Gottdenker <yairg@google.com>

lint

4f2791e

Signed-off-by: Yair Gottdenker <yairg@google.com>

KarthikSubbarao reviewed Oct 1, 2025

View reviewed changes

allenss-amazon reviewed Oct 4, 2025

View reviewed changes

Nivesh-01 added 2 commits October 14, 2025 15:23

Merge branch 'main' into non_vector_search_fixes

969d075

Signed-off-by: Nivesh <110855046+Nivesh-01@users.noreply.github.com>

This is a Merge and update of of PR #400

8838a3d

authored-by: Yair Gottdenker <yairg@google.com> Signed-off-by: Nivesh <110855046+Nivesh-01@users.noreply.github.com>

allenss-amazon closed this Oct 15, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

addressing non-vector search issues #400

addressing non-vector search issues #400

yairgott commented Sep 30, 2025

Uh oh!

KarthikSubbarao Oct 1, 2025 •

edited

Loading

Uh oh!

yairgott Oct 2, 2025 •

edited

Loading

Uh oh!

allenss-amazon Oct 4, 2025

Uh oh!

yairgott Oct 8, 2025

Uh oh!

allenss-amazon commented Oct 15, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

addressing non-vector search issues #400

addressing non-vector search issues #400

Conversation

yairgott commented Sep 30, 2025

Uh oh!

KarthikSubbarao Oct 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yairgott Oct 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

allenss-amazon Oct 4, 2025

Choose a reason for hiding this comment

Uh oh!

yairgott Oct 8, 2025

Choose a reason for hiding this comment

Uh oh!

allenss-amazon commented Oct 15, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

KarthikSubbarao Oct 1, 2025 •

edited

Loading

yairgott Oct 2, 2025 •

edited

Loading