-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Open
Description
As discussed in #16791 the long term plan in my mind (and that I would like to discuss with the community) is to replace SchemaAdapter
with PhysicalExprAdapter
.
There are multiple reasons for this:
- We can better optimize scenarios like missing columns or casts. For example, it's cheaper to cast a literal and evaluate it against the data as read from the file than it is to read the data from the file and cast that to the type of the literal. It is also cheaper to evaluate the expression
1 > col1
as1 > null
whencol1
is missing than it is to create an array of nulls. Since we can also simplifyPhysicalExpr
we can even simplify1 > null
into justnull
. - It's easier to manipulate
PhysicalExpr
s than it is to manipulate arrays. We already have machinery (TreeNode
APIs, etc.) to do so. - This is necessary to be able to push down projections into file scans which we need for upcoming Variant work and will also allow us to read single fields in a struct without reading the entire struct into memory. Basically if we want to be able to customize how expressions are evaluated for a specific format, in particular how
variant_get(column, 'field')
orget_field(column, 'field')
are executed in the context of a specific format (e.g. in parquet we can read single struct columns or use shredded variant) we need to have access to the expression in ParquetOpener in order to check if the file schema has the shredded variant field and generate the right ProjectionMask. - Paves the path for any other advanced optimizations, e.g. we could do crazy stuff like only read the dictionary page from a parquet column for a filter
col = 'a'
and if'a'
is not in the dictionary don't even bother reading the keys.
We've already implemented a replacement system for predicate pushdown via PhysicalExprAdapter
and have examples showing how to do some of the things a custom SchemaAdapter can do.
Once we implement #14993 we'll be able to deprecate SchemaAdapter for the most part.
Metadata
Metadata
Assignees
Labels
No labels