Skip to content

Plan to replace SchemaAdapter with PhysicalExprAdapter #16800

@adriangb

Description

@adriangb

As discussed in #16791 the long term plan in my mind (and that I would like to discuss with the community) is to replace SchemaAdapter with PhysicalExprAdapter.

There are multiple reasons for this:

  • We can better optimize scenarios like missing columns or casts. For example, it's cheaper to cast a literal and evaluate it against the data as read from the file than it is to read the data from the file and cast that to the type of the literal. It is also cheaper to evaluate the expression 1 > col1 as 1 > null when col1 is missing than it is to create an array of nulls. Since we can also simplify PhysicalExpr we can even simplify 1 > null into just null.
  • It's easier to manipulate PhysicalExprs than it is to manipulate arrays. We already have machinery (TreeNode APIs, etc.) to do so.
  • This is necessary to be able to push down projections into file scans which we need for upcoming Variant work and will also allow us to read single fields in a struct without reading the entire struct into memory. Basically if we want to be able to customize how expressions are evaluated for a specific format, in particular how variant_get(column, 'field') or get_field(column, 'field') are executed in the context of a specific format (e.g. in parquet we can read single struct columns or use shredded variant) we need to have access to the expression in ParquetOpener in order to check if the file schema has the shredded variant field and generate the right ProjectionMask.
  • Paves the path for any other advanced optimizations, e.g. we could do crazy stuff like only read the dictionary page from a parquet column for a filter col = 'a' and if 'a' is not in the dictionary don't even bother reading the keys.

We've already implemented a replacement system for predicate pushdown via PhysicalExprAdapter and have examples showing how to do some of the things a custom SchemaAdapter can do.
Once we implement #14993 we'll be able to deprecate SchemaAdapter for the most part.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions