-
Notifications
You must be signed in to change notification settings - Fork 33
Open
Description
What happens?
Using LIMIT can result in very strange Substrait plans which also produce incorrect results.
To Reproduce
This simpler example works fine:
# Setup
>>> import duckdb
>>> con.sql("DROP TABLE crossfit;")
>>> con.sql("CREATE TABLE crossfit (exercise text, difficulty_level int);")
>>> con.sql("INSERT INTO crossfit VALUES ('Push Ups', 3), ('Pull Ups', 5) , ('Push Jerk'\
, 7), ('Bar Muscle Up', 10);")
>>> con.sql("select * from crossfit;")
┌───────────────┬──────────────────┐
│ exercise │ difficulty_level │
│ varchar │ int32 │
├───────────────┼──────────────────┤
│ Push Ups │ 3 │
│ Pull Ups │ 5 │
│ Push Jerk │ 7 │
│ Bar Muscle Up │ 10 │
└───────────────┴──────────────────┘
# This is what will expect to get
>>> con.sql("select * from crossfit limit 1;")
┌──────────┬──────────────────┐
│ exercise │ difficulty_level │
│ varchar │ int32 │
├──────────┼──────────────────┤
│ Push Ups │ 3 │
└──────────┴──────────────────┘
# And we do get what we expect:
>>> print(con.sql(f"CALL from_substrait_json('{con.sql("CALL get_substrait_json('select \
* from crossfit limit 1')").fetchone()[0]}');"))
┌──────────┬──────────────────┐
│ exercise │ difficulty_level │
│ varchar │ int32 │
├──────────┼──────────────────┤
│ Push Ups │ 3 │
└──────────┴──────────────────┘
However another example doesn't work:
(data can be downloaded from https://files.brycemecum.com/penguins.parquet)
>>> import duckdb
>>> con = duckdb.connect()
>>> con.sql("LOAD substrait")
>>> con.sql("CREATE TABLE 'penguins' AS SELECT * FROM 'data/penguins.parquet';")
>>> plan_json = con.sql("CALL get_substrait_json('select * from penguins limit 1;');").fetchone()[0]
>>> print(con.sql(f"CALL from_substrait_json('{plan_json}');"))
┌─────────┬───────────┬────────────────┬───────────────┬───────────────────┬─────────────┬─────────┬───────┐
│ species │ island │ bill_length_mm │ bill_depth_mm │ flipper_length_mm │ body_mass_g │ sex │ year │
│ varchar │ varchar │ double │ double │ int32 │ int32 │ varchar │ int32 │
├─────────┼───────────┼────────────────┼───────────────┼───────────────────┼─────────────┼─────────┼───────┤
│ Adelie │ Torgersen │ 39.1 │ 18.7 │ 181 │ 3750 │ male │ 2007 │
│ Adelie │ Torgersen │ 39.5 │ 17.4 │ 186 │ 3800 │ female │ 2007 │
│ Adelie │ Torgersen │ 40.3 │ 18.0 │ 195 │ 3250 │ female │ 2007 │
│ Adelie │ Torgersen │ NULL │ NULL │ NULL │ NULL │ NULL │ 2007 │
│ Adelie │ Torgersen │ 36.7 │ 19.3 │ 193 │ 3450 │ female │ 2007 │
│ Adelie │ Torgersen │ 39.3 │ 20.6 │ 190 │ 3650 │ male │ 2007 │
│ Adelie │ Torgersen │ 38.9 │ 17.8 │ 181 │ 3625 │ female │ 2007 │
│ Adelie │ Torgersen │ 39.2 │ 19.6 │ 195 │ 4675 │ male │ 2007 │
│ Adelie │ Torgersen │ 34.1 │ 18.1 │ 193 │ 3475 │ NULL │ 2007 │
│ Adelie │ Torgersen │ 42.0 │ 20.2 │ 190 │ 4250 │ NULL │ 2007 │
│ · │ · │ · │ · │ · │ · │ · │ · │
│ · │ · │ · │ · │ · │ · │ · │ · │
│ · │ · │ · │ · │ · │ · │ · │ · │
│ Adelie │ Dream │ 32.1 │ 15.5 │ 188 │ 3050 │ female │ 2009 │
│ Adelie │ Dream │ 40.7 │ 17.0 │ 190 │ 3725 │ male │ 2009 │
│ Adelie │ Dream │ 37.3 │ 16.8 │ 192 │ 3000 │ female │ 2009 │
│ Adelie │ Dream │ 39.0 │ 18.7 │ 185 │ 3650 │ male │ 2009 │
│ Adelie │ Dream │ 39.2 │ 18.6 │ 190 │ 4250 │ male │ 2009 │
│ Adelie │ Dream │ 36.6 │ 18.4 │ 184 │ 3475 │ female │ 2009 │
│ Adelie │ Dream │ 36.0 │ 17.8 │ 195 │ 3450 │ female │ 2009 │
│ Adelie │ Dream │ 37.8 │ 18.1 │ 193 │ 3750 │ male │ 2009 │
│ Adelie │ Dream │ 36.0 │ 17.1 │ 187 │ 3700 │ female │ 2009 │
│ Adelie │ Dream │ 41.5 │ 18.5 │ 201 │ 4000 │ male │ 2009 │
├─────────┴───────────┴────────────────┴───────────────┴───────────────────┴─────────────┴─────────┴───────┤
│ 152 rows (20 shown) 8 columns │
└──────────────────────────────────────────────────────────────────────────────────────────────────────────┘
It's clearly done some filtering but the limit isn't respected and the plan it produces has a very strange structure given the query:
Root
|-- Project
|-- Project
|-- Sort
|-- Project
|-- Join
|-- Left
| |-- Read
|-- Right
|-- Fetch(offset=0, limit=10)
|-- Read
whereas the plan only needs to be this complicated:
Root
|-- Fetch(offset=0, limit=10)
|-- Read
I couldn't figure out how to reproduce with development versions but I did check the box below.
OS:
macOS 15
Substrait-Extension Version:
latest version from install substrait from community
DuckDB Version:
1.2.0
DuckDB Client:
1.2.0
Have you tried this on the latest master
branch?
- I have
Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?
- I have
Metadata
Metadata
Assignees
Labels
No labels