-
Notifications
You must be signed in to change notification settings - Fork 46
addressing non-vector search issues #400
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Yair Gottdenker <yairg@google.com>
return true; | ||
}; | ||
|
||
EvaluatePrefilteredKeys(parameters, entries_fetchers, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to clarify the ordering:
- We first perform the search using
EvaluateFilterAsPrimary
which creates the entries fetchers. We know this has an issue with AND (dropping lhs or rhs) and OR (duplicates keys). - Then we call
EvaluatePrefilteredKeys
uses an InlineSearch (now, calledPrefilterEvaluator
). This uses aflat_hash_set
to prevent duplicate keys. - Later on during response collection, we call
VerifyFilter
which uses thePredicateEvaluator
.
One question I have is how the AND predicate issue in EvaluateFilterAsPrimary
is addressed. Is it possible that we don't include one of the lhs/rhs?
if (predicate_type == PredicateType::kComposedAnd) {
if (lhs < rhs) {
AppendQueue(entries_fetchers, lhs_entries_fetchers);
return lhs;
}
AppendQueue(entries_fetchers, rhs_entries_fetchers);
return rhs;
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We know this has an issue with AND (dropping lhs or rhs) and OR (duplicates keys).
Just to be sure we're on the same page. This is how it is designed to work. It's designed to ensure that the decision whether to perform inline-filtering or pre-filtering is light-weighted by avoiding the overhead of fully executing the pre-filter, but rather providing a light-weighted estimate, which is the upper bound, of entries matching the filter. If it were to full apply the pre-filter, the overhead of the decision making becomes significantly higher.
Then we call EvaluatePrefilteredKeys uses an InlineSearch (now, called PrefilterEvaluator). This uses a flat_hash_set to prevent duplicate keys.
Clarifications:
- In addition to preventing duplications, it also evaluates every entry returned by
EvaluateFilterAsPrimary
against the filter. As elaborated below,VerifyFilter
addresses a very specific issue and should not be used to to do the filtering but rather the bare minimum amount of work as possible. This is becauseVerifyFilter
is called by the main thread, which is much less effective, rather by the worker threads. - For the non-vector queries, it's important to keep in mind that the search subsystem is designed to support this use-case very efficiently:
- Each non-vector index type is designed to return the amount of qualifying entries efficiently, without the need to iterate over subsets of entries. For instance the Tags index is implemented as a balanced Statistic Tree where each node maintains its child count.
- Each non-vector index type exposes a search interface which returns the amount of qualifying entries and a fetcher iterator. A fetcher iterator is very light-weighted, it doesn't maintain the qualifying entries set but rather just a way to fetch them through iteration.
EvaluateFilterAsPrimary
returns the set of fetcher iterators matching a sub filter with the minimum cardinality. For instance(a | b) & (c | d)
, the returned set referencesa
,b
if| a | + | b | < | c | + | d |
and vise-versa.- Evaluating the minimal amount of entries, through iterating over the returned fetcher iterators ensures the least amount of evaluation overhead.
Search
callsMaybeAddIndexedContent
now. See my third bullet point in the PR description.
Later on during response collection, we call VerifyFilter which uses the PredicateEvaluator.
The only reason that VerifyFilter
is called is to ensure the validity of the returned entries, accounting for keyspace mutations during query execution. Important to note that It is only called if the query return clause contains attributes which are not indexed. An issue arises when a response contain entries which their attribute values were modified during the query execution, in which case the mutations are only recorded by the keyspace but yet by the indexes. This may result in returning attribute values, fresh from the keyspace, which the filter doesn't match.
As an example: consider that a hash entry contains 2 attributes: f1
, f2
where just f1
is indexed, say as a numeric field. A query uses a pre-filter with f1 equals to v1_1
value and specify f2
as the returned attribute. Just before the query is executed, the value of f1 was v1_1
while f2
value was v2_1
, this state is captured by both the Keyspace and the indexes at this time. During the query execution, these values were changed to v1_2
and v2_2
. If we were to just return the entry with the value of v2_2
, we would had returned an invalid entry because there weren't any point of time where f1 had the value of v1_1
while f2
had the value of v2_2
.
To address this, the process involves fetching all the attributes, specified by either the filter or the return clause, from the Keyspace. The Keyspace may hold a newer version of the fetched attributes which don't matches the filter, therefore we re-evaluate the entries against the filter, by calling VerifyFilter
, with the most up to date attribute values before responding.
vector_index->AddPrefilteredKey(parameters.query, parameters.k, *key, | ||
results, top_keys); | ||
const auto &key = **iterator; | ||
// TODO: add a bloom filter to ensure distinct keys are evaluated |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't the comment wrong now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, it's still valid. result_keys
contains the keys which are currently tracked by the appender result set. But it might be that a key matched the filter, added to the appended result set and later was removed (with vector search). The bloom filter would make sure that keys which were removed from the result set are skipped (avoiding the need to evaluate the key and later skip appending it).
Signed-off-by: Nivesh <110855046+Nivesh-01@users.noreply.github.com>
authored-by: Yair Gottdenker <yairg@google.com> Signed-off-by: Nivesh <110855046+Nivesh-01@users.noreply.github.com>
PR #416 replaces this PR, same code, but updated to fix merge conflicts. |
The PR fixes multiple issues in the non-vector search code flow:
MaybeAddIndexedContent
function. This ensures the fetching of attribute values (for returned fields) is done by the worker threads using the values store in the indexers. This boosts query performance by avoiding the need to fetch the attributes values by the main thread if possible.In addition, thw PR contains the following cosmetic changes:
EvaluatePrefilteredKeys
. This function now supports both non-vector search and vector prefilter flows.VectorSearchParameters
has been renamed toSearchParameters
.InlineVectorEvaluator
has been renamed toPrefilterEvaluator
.