Search Pushdown (e.g. Vector Search) Into Table Providers #16358

westonpace · 2025-06-10T17:18:39Z

westonpace
Jun 10, 2025

What is a search

A search operation is defined as:

SELECT ... FROM ... WHERE filter ORDER BY score_or_distance LIMIT k OFFSET offset

This can describe a wide variety of scenarios. For example, pg_vector and single store introduce syntactic sugar for vector search:

-- Example from single store paper
SELECT * FROM t ORDER BY t.v <-> @vector LIMIT k;
-- Example from pgvector
SELECT * FROM items ORDER BY embedding <-> '[3,1,2]' LIMIT 5;

The column <--> '[3, 1, 2]' syntax is merely syntactic sugar for approx_l2_distance(column, '[3, 1, 2]') and whether or not we adopt such sugar is not a concern for this ticket (we could introduce a separate ticket for the SQL parser if this is desired).

Examples

Here are several different kinds of searches and examples of how they can be accelerated by a table provider.

Vector Search

In a vector search the score_or_distance is a distance function (e.g. l2, cosine, etc.) between a column and a query vector. Vector indices such as IVF_PQ or HNSW can speed up these searches.

Full Text Search

In a full text search the score_or_distance function is often a similarity score like BM25 (or TF-IDF or ...) Full text indices such as a BM25 index can speed up these searches.

Basic Search

Though not traditionally considered a search we should also look at basic queries that fit our pattern like SELECT ... FROM ... WHERE x > 50 ORDER BY y LIMIT 10. These too can be accelerated when there are indexes on x and y.

Goal

The challenge for query planning (and the intent of this ticket) is to allow this search to be pushed down into the table provider. Table providers can then use indexes to speed up such searches. The goal for this ticket is to make it possible for these searches to be pushed down into the table provider.

To be clear, my personal goal, is to be able to express vector search via SQL in Lance. A secondary goal is to be able to express full text search and basic search. Since Lance already implements a DataFusion TableProvider this seems like a natural way to go about it.

Details

What follows are some details that will need to be considered during implementation.

Approximate results

Searches with indexes often yield approximate results. Ideally the user should "opt in" to this in some way. In the example above I suggested the distance function be named approx_l2_distance. This would allow for another distance function (e.g. l2_distance) to exist which is not eligible for approximate vector indices. A more general approach might be to introduce an APPROXIMATE ORDER BY but this seems a bit much.

False negatives

When a search is combined with a filter then it can be difficult to combine the filter with the index exactly. Often approximations are made. For example, post-filtering first applies the search to K * some_constant results and then filters down the result set. This can easily lead to false negatives (less than K results) for highly selective filters. There are many other (and better) techniques but most can still lead to false negatives in some scenarios. This is acceptable and part of the approximate nature of the search.

Distributed engines

Fortunately, a distributed engine already pushes down the search in manner that is generally acceptable. The order by is pushed down into the individual workers and a final merge is placed on the final node. The workers can all push the search down further (into the table provider) and the top-k from each worker node can be merged, sorted, and limited again to provide good results. However, the details of how distributed query engines push down searches are beyond the scope of this ticket.

Refine factor

Some search engines may implement a "refine factor". This is typically expressed as a multiplier (e.g. 2). The search then uses the index to find the refine_factor * k nearest results, reads those from disk, and then sorts the vectors from disk applies a limit again. This is done because the index contains lossy-compressed versions of the vectors and so the comparisons made against the index are inexact and this refine step can give better results. I discuss this below in the approach section.

Non-goals

This ticket is not designed to add vector indexes to Datafusion, but rather to allow the search to be pushed down to the table provider. Obviously I am a bit biased here as I work on a table provider that would be able to satisfy this pushdown 😄. There is no problem with adding such indexes (or adopting another indexing library to add them) but that is beyond the scope and we should still allow for table provider pushdown.
This ticket is not designed to provide implementations for the various distance and/or score functions. That would be nice to have, so that a fallback implementation (e.g. exhaustive search) will exist for table providers that don't have speedup. However this is beyond the scope of this ticket.
As mentioned above, this ticket is not designed to add new syntactic sugar to the SQL parser. I'm not personally a huge fan but it does make it easier to ensure your query is "index eligible" and pgvector users may be coming to expect it at this point so 🤷. Still, that work can be a different ticket.

Background (other approaches)

SingleStore-V

In this paper by SingleStore a vector search is recognized and converted into the physical plan:

Project [t.v <-> @vector]
TopSort limit:k [t.v <-> @vector]
ColumnStoreFilter [
  Top(m, t.v <-> @vector, <filters>) index
]
ColumnStoreScan t

Note that m is refine_factor * k as described above.

Lance

Lance's physical plan is more low-level than the single store plan and Lance does not operate at the logical plan level for vector search today (that's part of my goal for this ticket). As a result, we have to deal with the concept that the vector index may not cover all partitions and so the plan generally looks something like...

TopK
Union
  VectorIndexSearch
    LoadPrefilter
  TopK
    NearestNeighbors
      Scan

Proposal

I'm a little uncertain what the best approach would be so here are several ideas.

Enhance Scan

The table provider's scan function already allow for some projection (column selection not dynamic column evaluation) and filters to be pushed down. We could add both dynamic column evaluation (in order to express the distance function) and sorts to be pushed down.

One advantage to this approach is that there may be other situations where order by pushdown could be handled by table providers (e.g. when the data is naturally sorted by some column) although there are other approaches (e.g. ordering info) for this already so it might not be a big deal.

The main con I see is that this feels like a slippery slope and maybe we should just jump to...

Full logical plan pushdown

In this approach the table provider is simply given the logical plan and asked to return both a refined logical plan which a logical "pushdown node" which is a wrapper around the lower level physical plan. This gives the table provider a lot of leeway but it would be more complicated perhaps to implement such a method. Also, this happens at the "plan level" and not the "scan level" and so you have questions like "what to do if there are multiple table providers"

Wrap as filter

We could convert the search into a function, similar to the SignalStore approach, but at the logical level. I like this idea but I don't think we should bake the concept of "refine factor" into it and we should make it generic to any search. We don't have to actually represent this conversion in the plan at any point. It could be an Expr that only ever reaches the scan pushdown method (and if the provider can't push it down then leave the plan unchanged).

New logical node

Another possibility could be to introduce the concept of a "search" logical node. I haven't fully thought this one through and it might be complicated by the fact that the search already comes from several different points (e.g. it combines filter and sort and limit). Still, it could be an interesting relational algebra exercise. This could then be added to the table provider's scan or we could add a new search method to the TableProvider.

backkem · 2025-06-10T20:59:32Z

backkem
Jun 10, 2025

FWIW there is an existing implementation of the "full logical plan pushdown" in datafusion-contrib/datafusion-federation. Approach: we extend the TableProvider to keep track of which (remote) engine represents it. A federation optimiser finds the largest sub-plans provided by one engine. This sub-plan is passed to an engine-specific optimiser that can self-determine how much of this sub-plan it can federate.

0 replies

niebayes · 2025-06-13T14:38:19Z

niebayes
Jun 13, 2025

Cool

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Search Pushdown (e.g. Vector Search) Into Table Providers #16358

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Search Pushdown (e.g. Vector Search) Into Table Providers #16358

Uh oh!

westonpace Jun 10, 2025

What is a search

Examples

Vector Search

Full Text Search

Basic Search

Goal

Details

Approximate results

False negatives

Distributed engines

Refine factor

Non-goals

Background (other approaches)

SingleStore-V

Lance

Proposal

Enhance Scan

Full logical plan pushdown

Wrap as filter

New logical node

Replies: 2 comments

Uh oh!

backkem Jun 10, 2025

Uh oh!

niebayes Jun 13, 2025

westonpace
Jun 10, 2025

backkem
Jun 10, 2025

niebayes
Jun 13, 2025