Search Pushdown (e.g. Vector Search) Into Table Providers #16358
westonpace
started this conversation in
Ideas
Replies: 2 comments
-
FWIW there is an existing implementation of the "full logical plan pushdown" in datafusion-contrib/datafusion-federation. Approach: we extend the TableProvider to keep track of which (remote) engine represents it. A federation optimiser finds the largest sub-plans provided by one engine. This sub-plan is passed to an engine-specific optimiser that can self-determine how much of this sub-plan it can federate. |
Beta Was this translation helpful? Give feedback.
0 replies
-
Cool |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
What is a search
A search operation is defined as:
This can describe a wide variety of scenarios. For example, pg_vector and single store introduce syntactic sugar for vector search:
The
column <--> '[3, 1, 2]'
syntax is merely syntactic sugar forapprox_l2_distance(column, '[3, 1, 2]')
and whether or not we adopt such sugar is not a concern for this ticket (we could introduce a separate ticket for the SQL parser if this is desired).Examples
Here are several different kinds of searches and examples of how they can be accelerated by a table provider.
Vector Search
In a vector search the
score_or_distance
is a distance function (e.g. l2, cosine, etc.) between a column and a query vector. Vector indices such as IVF_PQ or HNSW can speed up these searches.Full Text Search
In a full text search the
score_or_distance
function is often a similarity score like BM25 (or TF-IDF or ...) Full text indices such as a BM25 index can speed up these searches.Basic Search
Though not traditionally considered a search we should also look at basic queries that fit our pattern like
SELECT ... FROM ... WHERE x > 50 ORDER BY y LIMIT 10
. These too can be accelerated when there are indexes onx
andy
.Goal
The challenge for query planning (and the intent of this ticket) is to allow this search to be pushed down into the table provider. Table providers can then use indexes to speed up such searches. The goal for this ticket is to make it possible for these searches to be pushed down into the table provider.
To be clear, my personal goal, is to be able to express vector search via SQL in Lance. A secondary goal is to be able to express full text search and basic search. Since Lance already implements a DataFusion TableProvider this seems like a natural way to go about it.
Details
What follows are some details that will need to be considered during implementation.
Approximate results
Searches with indexes often yield approximate results. Ideally the user should "opt in" to this in some way. In the example above I suggested the distance function be named
approx_l2_distance
. This would allow for another distance function (e.g.l2_distance
) to exist which is not eligible for approximate vector indices. A more general approach might be to introduce anAPPROXIMATE ORDER BY
but this seems a bit much.False negatives
When a search is combined with a filter then it can be difficult to combine the filter with the index exactly. Often approximations are made. For example, post-filtering first applies the search to
K * some_constant
results and then filters down the result set. This can easily lead to false negatives (less than K results) for highly selective filters. There are many other (and better) techniques but most can still lead to false negatives in some scenarios. This is acceptable and part of the approximate nature of the search.Distributed engines
Fortunately, a distributed engine already pushes down the search in manner that is generally acceptable. The order by is pushed down into the individual workers and a final merge is placed on the final node. The workers can all push the search down further (into the table provider) and the top-k from each worker node can be merged, sorted, and limited again to provide good results. However, the details of how distributed query engines push down searches are beyond the scope of this ticket.
Refine factor
Some search engines may implement a "refine factor". This is typically expressed as a multiplier (e.g. 2). The search then uses the index to find the
refine_factor * k
nearest results, reads those from disk, and then sorts the vectors from disk applies a limit again. This is done because the index contains lossy-compressed versions of the vectors and so the comparisons made against the index are inexact and this refine step can give better results. I discuss this below in the approach section.Non-goals
Background (other approaches)
SingleStore-V
In this paper by SingleStore a vector search is recognized and converted into the physical plan:
Note that
m
isrefine_factor * k
as described above.Lance
Lance's physical plan is more low-level than the single store plan and Lance does not operate at the logical plan level for vector search today (that's part of my goal for this ticket). As a result, we have to deal with the concept that the vector index may not cover all partitions and so the plan generally looks something like...
Proposal
I'm a little uncertain what the best approach would be so here are several ideas.
Enhance Scan
The table provider's scan function already allow for some projection (column selection not dynamic column evaluation) and filters to be pushed down. We could add both dynamic column evaluation (in order to express the distance function) and sorts to be pushed down.
One advantage to this approach is that there may be other situations where order by pushdown could be handled by table providers (e.g. when the data is naturally sorted by some column) although there are other approaches (e.g. ordering info) for this already so it might not be a big deal.
The main con I see is that this feels like a slippery slope and maybe we should just jump to...
Full logical plan pushdown
In this approach the table provider is simply given the logical plan and asked to return both a refined logical plan which a logical "pushdown node" which is a wrapper around the lower level physical plan. This gives the table provider a lot of leeway but it would be more complicated perhaps to implement such a method. Also, this happens at the "plan level" and not the "scan level" and so you have questions like "what to do if there are multiple table providers"
Wrap as filter
We could convert the search into a function, similar to the SignalStore approach, but at the logical level. I like this idea but I don't think we should bake the concept of "refine factor" into it and we should make it generic to any search. We don't have to actually represent this conversion in the plan at any point. It could be an
Expr
that only ever reaches the scan pushdown method (and if the provider can't push it down then leave the plan unchanged).New logical node
Another possibility could be to introduce the concept of a "search" logical node. I haven't fully thought this one through and it might be complicated by the fact that the search already comes from several different points (e.g. it combines filter and sort and limit). Still, it could be an interesting relational algebra exercise. This could then be added to the table provider's
scan
or we could add a newsearch
method to theTableProvider
.Beta Was this translation helpful? Give feedback.
All reactions