Skip to content

LIMIT clause not always working correctly #150

@amoeba

Description

@amoeba

What happens?

Using LIMIT can result in very strange Substrait plans which also produce incorrect results.

To Reproduce

This simpler example works fine:

# Setup
>>> import duckdb
>>> con.sql("DROP TABLE crossfit;")
>>> con.sql("CREATE TABLE crossfit (exercise text, difficulty_level int);")
>>> con.sql("INSERT INTO crossfit VALUES ('Push Ups', 3), ('Pull Ups', 5) , ('Push Jerk'\
, 7), ('Bar Muscle Up', 10);")
>>> con.sql("select * from crossfit;")
┌───────────────┬──────────────────┐
│   exercisedifficulty_level │
│    varcharint32       │
├───────────────┼──────────────────┤
│ Push Ups3 │
│ Pull Ups5 │
│ Push Jerk7 │
│ Bar Muscle Up10 │
└───────────────┴──────────────────┘

# This is what will expect to get

>>> con.sql("select * from crossfit limit 1;")
┌──────────┬──────────────────┐
│ exercisedifficulty_level │
│ varcharint32       │
├──────────┼──────────────────┤
│ Push Ups3 │
└──────────┴──────────────────┘

# And we do get what we expect:

>>> print(con.sql(f"CALL from_substrait_json('{con.sql("CALL get_substrait_json('select \
* from crossfit limit 1')").fetchone()[0]}');"))
┌──────────┬──────────────────┐
│ exercisedifficulty_level │
│ varcharint32       │
├──────────┼──────────────────┤
│ Push Ups3 │
└──────────┴──────────────────┘

However another example doesn't work:

(data can be downloaded from https://files.brycemecum.com/penguins.parquet)

>>> import duckdb
>>> con = duckdb.connect()
>>> con.sql("LOAD substrait")
>>> con.sql("CREATE TABLE 'penguins' AS SELECT * FROM 'data/penguins.parquet';")
>>> plan_json = con.sql("CALL get_substrait_json('select * from penguins limit 1;');").fetchone()[0]
>>> print(con.sql(f"CALL from_substrait_json('{plan_json}');"))
┌─────────┬───────────┬────────────────┬───────────────┬───────────────────┬─────────────┬─────────┬───────┐
│ speciesislandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsexyear  │
│ varcharvarchardoubledoubleint32int32varcharint32 │
├─────────┼───────────┼────────────────┼───────────────┼───────────────────┼─────────────┼─────────┼───────┤
│ AdelieTorgersen39.118.71813750male2007 │
│ AdelieTorgersen39.517.41863800female2007 │
│ AdelieTorgersen40.318.01953250female2007 │
│ AdelieTorgersenNULLNULLNULLNULLNULL2007 │
│ AdelieTorgersen36.719.31933450female2007 │
│ AdelieTorgersen39.320.61903650male2007 │
│ AdelieTorgersen38.917.81813625female2007 │
│ AdelieTorgersen39.219.61954675male2007 │
│ AdelieTorgersen34.118.11933475NULL2007 │
│ AdelieTorgersen42.020.21904250NULL2007 │
│   ·     │   ·       │             ·  │            ·  │                ·  │          ·  │  ·      │    ·  │
│   ·     │   ·       │             ·  │            ·  │                ·  │          ·  │  ·      │    ·  │
│   ·     │   ·       │             ·  │            ·  │                ·  │          ·  │  ·      │    ·  │
│ AdelieDream32.115.51883050female2009 │
│ AdelieDream40.717.01903725male2009 │
│ AdelieDream37.316.81923000female2009 │
│ AdelieDream39.018.71853650male2009 │
│ AdelieDream39.218.61904250male2009 │
│ AdelieDream36.618.41843475female2009 │
│ AdelieDream36.017.81953450female2009 │
│ AdelieDream37.818.11933750male2009 │
│ AdelieDream36.017.11873700female2009 │
│ AdelieDream41.518.52014000male2009 │
├─────────┴───────────┴────────────────┴───────────────┴───────────────────┴─────────────┴─────────┴───────┤
│ 152 rows (20 shown)                                                                            8 columns │
└──────────────────────────────────────────────────────────────────────────────────────────────────────────┘

It's clearly done some filtering but the limit isn't respected and the plan it produces has a very strange structure given the query:

Root
|-- Project
    |-- Project
        |-- Sort
            |-- Project
                |-- Join
                    |-- Left
                    |   |-- Read
                    |-- Right
                        |-- Fetch(offset=0, limit=10)
                            |-- Read

whereas the plan only needs to be this complicated:

Root
|-- Fetch(offset=0, limit=10)
    |-- Read

I couldn't figure out how to reproduce with development versions but I did check the box below.

OS:

macOS 15

Substrait-Extension Version:

latest version from install substrait from community

DuckDB Version:

1.2.0

DuckDB Client:

1.2.0

Have you tried this on the latest master branch?

  • I have

Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?

  • I have

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions