Skip to content

feat: use spawned tasks to reduce call stack depth and avoid busy waiting #16319

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

pepijnve
Copy link
Contributor

@pepijnve pepijnve commented Jun 7, 2025

Which issue does this PR close?

Rationale for this change

Yielding to the runtime in Tokio involves unwinding the call stack. When a query contains many nested pipeline blockers when it yields it's likely to do so from quite deep. PRs #16196 and/or #16301 increase the frequency of this.

Luckily Tokio provides just the tool to solve this: spawned tasks. By moving the blocking portion of operators to a spawned task, the call stack depth is significantly reduced. Additionally the caller no longer needs to poll the blocking task in a busy loop since it will only get woken when the spawned task completes which is the signal that the stream is ready to start emitting data.

What this change effectively does is chop a chain of n dependent pipeline blockers into n sub tasks. Only one of these subtasks will actually be scheduled by the runtime at any given time. The others will simply wait until their direct dependent is ready to emit data.

Possible alternative

It could be interesting to generalize the pattern of a pipeline blocking operator having a blocking prepare phase, followed by a streaming emit phase. I think this would probably have to take the shape of support types for operator implementations since this is rather tightly coupled to the internal implementation of an operator. I did not attempt to design this kind of support code yet. I don't having general support code is a strict prerequisite for this PR through. That's something that can be done in a later refactor since the current change is 100% implementation details.

What changes are included in this PR?

Wrap the blocking portion of sort and join (build phase) in a spawned task.
The spawned tasks are kicked off the first time the stream is polled, not when execute is called. The guideline on this is not yet clear to me, but this is related to #16312.

The aggregation operators have not been modified in this PR yet, but those could benefit from the same change. If this direction is deemed promising I will update those as well.

Are these changes tested?

No new tests added, covered by existing tests.

❌ The WASM tests fail due to those not yet running in a tokio context. Looking for feedback on whether that's a showstopper or not and how to fix that.

Are there any user-facing changes?

No, the modified operators yield more efficiently but nothing else changes.

@github-actions github-actions bot added common Related to common crate physical-plan Changes to the physical-plan crate labels Jun 7, 2025
@pepijnve pepijnve force-pushed the issue_16318 branch 4 times, most recently from 8769ce4 to 2a965dc Compare June 8, 2025 15:03
break;
// Spawn a task the first time the stream is polled for the sort phase.
// This ensures the consumer of the sort does not poll unnecessarily
// while the sort is ongoing
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I may not fully understand this change, but I believe it will start generating input as soon as execute is called, rather than waiting for the first poll...

Another potential problem is that it will disconnect the production of batches from the consumption of them -- in other words by design I think it will potentially produce batches on a different thread than consumes them

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I intentionally tried to avoid that and I'm fairly sure it does, but I'll try to come up with a test that demonstrates it since it's a very important detail.

These kinds of constructs have an Inceptiony quality to them, don't they. If I got my head wrapped around Futures and async correctly they're essentially interchangeable and inert until first polled.

What should be happening here is that you get stream::once creates a stream consisting of a single element which is to be produced by a future. The future in question is only polled the first time the stream is polled.

That future is an async block which in its first poll spawns the task. In other words, the spawn is deferred until first poll. Then we await the spawned task, which is just polling the JoinHandle.

Anyway, I don't want you to take my word for it. I'll get to work on a test case.

Copy link
Contributor Author

@pepijnve pepijnve Jun 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might have gotten this completely wrong, but that should not be the case. You're correct that preparing the stream is potentially on a different thread, but what gets returned by the task is the stream itself, not the individual record batches. The produced stream should still be getting drained by the original task. What you're describing is what RecordBatchReceiverStream does which is quite different and requires channels for the inter-thread communication.

Copy link
Contributor Author

@pepijnve pepijnve Jun 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a test case that attempts to demonstrate that processing is deferred. If this looks ok to you I can add the same thing for the other touched code as well.

I'm not sure how I can demonstrate the absence of multi-threading in a test case.

Wrt comprehensibility, I have to admit I am still very much in the learning-as-I-go phase of using the futures crate. There might be a more elegant or straightforward way to express this construct.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @pepijnve -- this is an interesting idea, but I have a concern. I'll also fire off some benchmarks to see if it has a measurable impact

@alamb
Copy link
Contributor

alamb commented Jun 8, 2025

🤖 ./gh_compare_branch.sh Benchmark Script Running
Linux aal-dev 6.11.0-1013-gcp #13~24.04.1-Ubuntu SMP Wed Apr 2 16:34:16 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing issue_16318 (2a965dc) to 1daa5ed diff
Benchmarks: tpch_mem clickbench_partitioned clickbench_extended
Results will be posted here when complete

@alamb
Copy link
Contributor

alamb commented Jun 9, 2025

🤖: Benchmark completed

Details

Comparing HEAD and issue_16318
--------------------
Benchmark clickbench_extended.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┓
┃ Query        ┃        HEAD ┃ issue_16318 ┃       Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━┩
│ QQuery 0     │  1764.58 ms │  1887.17 ms │ 1.07x slower │
│ QQuery 1     │   697.84 ms │   737.88 ms │ 1.06x slower │
│ QQuery 2     │  1371.85 ms │  1456.38 ms │ 1.06x slower │
│ QQuery 3     │   679.75 ms │   706.70 ms │    no change │
│ QQuery 4     │  1455.11 ms │  1441.33 ms │    no change │
│ QQuery 5     │ 15616.91 ms │ 15838.54 ms │    no change │
│ QQuery 6     │  2021.53 ms │  2092.32 ms │    no change │
│ QQuery 7     │  2070.45 ms │  2255.10 ms │ 1.09x slower │
│ QQuery 8     │   850.20 ms │   820.90 ms │    no change │
└──────────────┴─────────────┴─────────────┴──────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary          ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)          │ 26528.21ms │
│ Total Time (issue_16318)   │ 27236.34ms │
│ Average Time (HEAD)        │  2947.58ms │
│ Average Time (issue_16318) │  3026.26ms │
│ Queries Faster             │          0 │
│ Queries Slower             │          4 │
│ Queries with No Change     │          5 │
│ Queries with Failure       │          0 │
└────────────────────────────┴────────────┘
--------------------
Benchmark clickbench_partitioned.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃        HEAD ┃ issue_16318 ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0     │    15.98 ms │    14.96 ms │ +1.07x faster │
│ QQuery 1     │    32.71 ms │    32.55 ms │     no change │
│ QQuery 2     │    81.72 ms │    80.60 ms │     no change │
│ QQuery 3     │    98.69 ms │    94.35 ms │     no change │
│ QQuery 4     │   579.71 ms │   595.93 ms │     no change │
│ QQuery 5     │   845.60 ms │   861.52 ms │     no change │
│ QQuery 6     │    23.05 ms │    23.03 ms │     no change │
│ QQuery 7     │    37.18 ms │    35.64 ms │     no change │
│ QQuery 8     │   896.54 ms │   888.95 ms │     no change │
│ QQuery 9     │  1154.45 ms │  1185.07 ms │     no change │
│ QQuery 10    │   265.14 ms │   266.51 ms │     no change │
│ QQuery 11    │   296.19 ms │   303.53 ms │     no change │
│ QQuery 12    │   896.36 ms │   910.80 ms │     no change │
│ QQuery 13    │  1212.33 ms │  1372.46 ms │  1.13x slower │
│ QQuery 14    │   833.77 ms │   849.28 ms │     no change │
│ QQuery 15    │   810.40 ms │   825.57 ms │     no change │
│ QQuery 16    │  1702.67 ms │  1754.50 ms │     no change │
│ QQuery 17    │  1593.37 ms │  1622.95 ms │     no change │
│ QQuery 18    │  3061.89 ms │  3090.11 ms │     no change │
│ QQuery 19    │    80.71 ms │    84.37 ms │     no change │
│ QQuery 20    │  1131.96 ms │  1171.56 ms │     no change │
│ QQuery 21    │  1299.18 ms │  1355.14 ms │     no change │
│ QQuery 22    │  2190.27 ms │  2289.42 ms │     no change │
│ QQuery 23    │  8012.02 ms │  8214.48 ms │     no change │
│ QQuery 24    │   467.12 ms │   479.98 ms │     no change │
│ QQuery 25    │   384.56 ms │   405.53 ms │  1.05x slower │
│ QQuery 26    │   527.13 ms │   538.47 ms │     no change │
│ QQuery 27    │  1590.25 ms │  1648.26 ms │     no change │
│ QQuery 28    │ 13842.27 ms │ 12564.81 ms │ +1.10x faster │
│ QQuery 29    │   533.53 ms │   522.59 ms │     no change │
│ QQuery 30    │   803.36 ms │   816.08 ms │     no change │
│ QQuery 31    │   863.54 ms │   862.31 ms │     no change │
│ QQuery 32    │  2627.68 ms │  2692.79 ms │     no change │
│ QQuery 33    │  3276.79 ms │  3356.72 ms │     no change │
│ QQuery 34    │  3323.95 ms │  3383.36 ms │     no change │
│ QQuery 35    │  1273.32 ms │  1281.73 ms │     no change │
│ QQuery 36    │   122.83 ms │   122.15 ms │     no change │
│ QQuery 37    │    55.54 ms │    57.40 ms │     no change │
│ QQuery 38    │   124.07 ms │   123.23 ms │     no change │
│ QQuery 39    │   197.18 ms │   195.97 ms │     no change │
│ QQuery 40    │    48.39 ms │    49.25 ms │     no change │
│ QQuery 41    │    43.34 ms │    44.45 ms │     no change │
│ QQuery 42    │    38.56 ms │    38.53 ms │     no change │
└──────────────┴─────────────┴─────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary          ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)          │ 57295.30ms │
│ Total Time (issue_16318)   │ 57106.84ms │
│ Average Time (HEAD)        │  1332.45ms │
│ Average Time (issue_16318) │  1328.07ms │
│ Queries Faster             │          2 │
│ Queries Slower             │          2 │
│ Queries with No Change     │         39 │
│ Queries with Failure       │          0 │
└────────────────────────────┴────────────┘
--------------------
Benchmark tpch_mem_sf1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┓
┃ Query        ┃      HEAD ┃ issue_16318 ┃       Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━┩
│ QQuery 1     │ 119.84 ms │   119.54 ms │    no change │
│ QQuery 2     │  22.34 ms │    21.44 ms │    no change │
│ QQuery 3     │  33.02 ms │    32.72 ms │    no change │
│ QQuery 4     │  20.20 ms │    19.94 ms │    no change │
│ QQuery 5     │  53.21 ms │    51.88 ms │    no change │
│ QQuery 6     │  12.31 ms │    12.07 ms │    no change │
│ QQuery 7     │  96.52 ms │    93.30 ms │    no change │
│ QQuery 8     │  25.53 ms │    25.72 ms │    no change │
│ QQuery 9     │  57.51 ms │    58.47 ms │    no change │
│ QQuery 10    │  48.69 ms │    49.67 ms │    no change │
│ QQuery 11    │  11.31 ms │    11.33 ms │    no change │
│ QQuery 12    │  40.74 ms │    41.53 ms │    no change │
│ QQuery 13    │  27.30 ms │    27.26 ms │    no change │
│ QQuery 14    │   9.71 ms │     9.89 ms │    no change │
│ QQuery 15    │  22.36 ms │    21.91 ms │    no change │
│ QQuery 16    │  20.83 ms │    21.17 ms │    no change │
│ QQuery 17    │  94.35 ms │    90.99 ms │    no change │
│ QQuery 18    │ 204.01 ms │   204.54 ms │    no change │
│ QQuery 19    │  25.76 ms │    25.56 ms │    no change │
│ QQuery 20    │  34.57 ms │    35.02 ms │    no change │
│ QQuery 21    │ 153.21 ms │   160.51 ms │    no change │
│ QQuery 22    │  16.12 ms │    17.13 ms │ 1.06x slower │
└──────────────┴───────────┴─────────────┴──────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary          ┃           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (HEAD)          │ 1149.45ms │
│ Total Time (issue_16318)   │ 1151.58ms │
│ Average Time (HEAD)        │   52.25ms │
│ Average Time (issue_16318) │   52.34ms │
│ Queries Faster             │         0 │
│ Queries Slower             │         1 │
│ Queries with No Change     │        21 │
│ Queries with Failure       │         0 │
└────────────────────────────┴───────────┘

@pepijnve
Copy link
Contributor Author

pepijnve commented Jun 9, 2025

@alamb I've been trying to make sense of what to do with the benchmark results. They always seem to give me very mixed results when I run them locally (that's part of why I did the min/max/stddev thing, to try to get more insight). Some tests are slower but total and average time increase is marginal. Should I take a closer look at the 1.13 slower one?

clickbench_extended seems to have a consistent penalty. I'll try to understand why that is.

@pepijnve
Copy link
Contributor Author

pepijnve commented Jun 9, 2025

I had a look at clickbench_extended. I cannot explain the slowdown. Those queries do not even use sorting or joins. The plan for the first one for instance is

AggregateExec: mode=Final                                                                                                                                                                                                                                                                                                                                                                                                                                                 
  CoalescePartitionsExec                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     
    AggregateExec: mode=Partial                                                                                                                                                                                                                                                                                                                                                                                                                                           
      DataSourceExec:

@pepijnve
Copy link
Contributor Author

pepijnve commented Jun 9, 2025

Looking at the clickbench_partitioned outliers. Wrt the code changes in this PR they seem pretty similar yet one has basically the opposite result of the other. What's interesting is that total time is actually lower for the run. I collected target_partitions = 1 plans. Perhaps the results are influenced by repartition/coalesce?

Query 13 1.13x slower

SELECT "SearchPhrase", COUNT(DISTINCT "UserID") AS u FROM 'hits.parquet' WHERE "SearchPhrase" <> '' GROUP BY "SearchPhrase" ORDER BY u DESC LIMIT 10;
SortExec: TopK(fetch=10), expr=[u@1 DESC], preserve_partitioning=[false]
  ProjectionExec: expr=[SearchPhrase@0 as SearchPhrase, count(alias1)@1 as u]
    AggregateExec: mode=Single, gby=[SearchPhrase@0 as SearchPhrase], aggr=[count(alias1)]
      AggregateExec: mode=Single, gby=[SearchPhrase@1 as SearchPhrase, UserID@0 as alias1], aggr=[]
        CoalesceBatchesExec: target_batch_size=8192
          FilterExec: SearchPhrase@1 !=
            DataSourceExec: file_groups={1 group: [[Users/pepijn/RustroverProjects/datafusion/benchmarks/data/hits.parquet:0..14779976446]]}, projection=[UserID, SearchPhrase], file_type=parquet, predicate=SearchPhrase@39 != , pruning_predicate=SearchPhrase_null_count@2 != row_count@3 AND (SearchPhrase_min@0 !=  OR  != SearchPhrase_max@1), required_guarantees=[SearchPhrase not in ()]

Query 28: 1.10x faster

SELECT REGEXP_REPLACE("Referer", '^https?://(?:www\.)?([^/]+)/.*$', '\1') AS k, AVG(length("Referer")) AS l, COUNT(*) AS c, MIN("Referer") FROM 'hits.parquet' WHERE "Referer" <> '' GROUP BY k HAVING COUNT(*) > 100000 ORDER BY l DESC LIMIT 25;
SortExec: TopK(fetch=25), expr=[l@1 DESC], preserve_partitioning=[false]
  ProjectionExec: expr=[regexp_replace(hits.parquet.Referer,Utf8("^https?://(?:www\.)?([^/]+)/.*$"),Utf8("\1"))@0 as k, avg(character_length(hits.parquet.Referer))@1 as l, count(Int64(1))@2 as c, min(hits.parquet.Referer)@3 as min(hits.parquet.Referer)]
    CoalesceBatchesExec: target_batch_size=8192
      FilterExec: count(Int64(1))@2 > 100000
        AggregateExec: mode=Single, gby=[regexp_replace(Referer@0, ^https?://(?:www\.)?([^/]+)/.*$, \1) as regexp_replace(hits.parquet.Referer,Utf8("^https?://(?:www\.)?([^/]+)/.*$"),Utf8("\1"))], aggr=[avg(character_length(hits.parquet.Referer)), count(Int64(1)), min(hits.parquet.Referer)]
          CoalesceBatchesExec: target_batch_size=8192
            FilterExec: Referer@0 !=
              DataSourceExec: file_groups={1 group: [[Users/pepijn/RustroverProjects/datafusion/benchmarks/data/hits.parquet:0..14779976446]]}, projection=[Referer], file_type=parquet, predicate=Referer@14 != , pruning_predicate=Referer_null_count@2 != row_count@3 AND (Referer_min@0 !=  OR  != Referer_max@1), required_guarantees=[Referer not in ()]

@pepijnve pepijnve force-pushed the issue_16318 branch 2 times, most recently from 7828d0e to 0e4fbdb Compare June 10, 2025 08:44
@Dandandan
Copy link
Contributor

In what situations would these changes lead to better performance?
I.e. why is query 28 28: ~ 1.10x faster?

@Dandandan
Copy link
Contributor

(Or is it just benchmark noise?)

@pepijnve
Copy link
Contributor Author

In what situations would these changes lead to better performance? I.e. why is query 28 28: ~ 1.10x faster?

The jury is still out on whether it makes sense or not. I can explain my theoretical reasoning. Apologies up front if I'm writing too pedantically. Just trying to explain things as clearly as I can. Not a database guy so this may sounds hopelessly naive.

The first observation is that yielding all the way to the runtime in Tokio requires stack unwinding. That's the nature of stackless tasks. The deeper your call stack is, the more call frames you need to unwind and the more calls you'll need to redo to get back to the yield point. I've been trying to find information on whether the Rust compiler does some magic to avoid this, but as far as I can tell that's not the case. I did find hints that it optimizes nested async function calls, but it will not do so for nested dyn Stream poll_next calls. Makes sense; an aot compiler will typically not be able to optimize across virtual function calls.
The consequence is that yielding to the runtime can have a non trivial cost. The other PR you're reviewing is an extreme example of that.

Second observation is that DataFusion's volcano model naturally leads to fairly deep call stacks. The tree of execution plans results in a tree of streams and a parent stream's poll_next will often directly call poll_next on a child. If you get one of these deep call stacks, yielding from the deepest point potentially means unwinding the whole thing and coming back. This is mitigated a bit already when volcano breaking operators like repartition are present in the plan. The deepest call stacks are seen when running with target_partitions = 1.

Third, pipeline breaking operators are intrinsically two-phase. First they collect, then they emit. There's a gray area of course, but I'm talking about the classic ones like single aggregation. While a pipeline breaking stream is in its collect phase, it can be 100% sure that it will not have any new data for poll_next caller until that phase completes. There's really not much point in telling the caller Poll::Pending over and over again because that leads to busy waiting. But you do still want to yield to the runtime periodically to not squat the Tokio executor threads.
So there are situations where there are potentially long phases where any yield to the caller is redundant (there's no new info), but you still need to yield for cooperative scheduling.

Combing all that I think you're looking for deep query plans with nested pipeline breakers. In a different PR someone pointed me to this query


The nested sorts in the physical plan are something of a worst case scenario. At the deepest sort you have a 12 level deep call stack that gets reactivated for every yield. If instead we chop this into 6 chained spawned task, you get 6 much shallower call stacks. Of those tasks only one will be active, the other ones will be inert until there's actually something to do.

A second factor can be the data source. Filesystem streams tend to be always ready, others may not. The more the source returns pending the more you'll see the overhead described above show up I think.

All of this assumes of course that going up and back down the call stack has a non trivial cost. Perhaps it's not significant enough to be measurable. I'm still figuring out how to best profile this stuff, so I'm afraid I don't have anything more fact based to give you yet.

Besides performance there's a small aesthetic aspect to this. I find that a stream that responds with
pending <wake> ready ready ready none
is more elegant than
pending pending pending pending ... pending ready ready ready none
The first one abstracts what's going on underneath better than the letter. But I understand that raw performance trumps aesthetics here.

@pepijnve
Copy link
Contributor Author

I think I'll put this one in draft for now. Benchmarks results say "needs more work and performance analysis" to me.

Comparing baseline and branch
--------------------
Benchmark clickbench_extended.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Query        ┃    baseline ┃      branch ┃    Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ QQuery 0     │  2827.60 ms │  2861.15 ms │ no change │
│ QQuery 1     │  1275.24 ms │  1271.33 ms │ no change │
│ QQuery 2     │  2529.29 ms │  2542.19 ms │ no change │
│ QQuery 3     │  1082.87 ms │  1105.09 ms │ no change │
│ QQuery 4     │  2774.37 ms │  2834.65 ms │ no change │
│ QQuery 5     │ 32714.77 ms │ 32948.06 ms │ no change │
│ QQuery 6     │  3662.78 ms │  3610.34 ms │ no change │
│ QQuery 7     │  4411.89 ms │  4412.35 ms │ no change │
│ QQuery 8     │  1622.98 ms │  1641.82 ms │ no change │
└──────────────┴─────────────┴─────────────┴───────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary       ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (baseline)   │ 52901.80ms │
│ Total Time (branch)     │ 53226.99ms │
│ Average Time (baseline) │  5877.98ms │
│ Average Time (branch)   │  5914.11ms │
│ Queries Faster          │          0 │
│ Queries Slower          │          0 │
│ Queries with No Change  │          9 │
│ Queries with Failure    │          0 │
└─────────────────────────┴────────────┘
--------------------
Benchmark clickbench_partitioned.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃    baseline ┃      branch ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0     │    29.01 ms │    28.68 ms │     no change │
│ QQuery 1     │    74.78 ms │    69.50 ms │ +1.08x faster │
│ QQuery 2     │   159.91 ms │   167.57 ms │     no change │
│ QQuery 3     │   183.96 ms │   188.96 ms │     no change │
│ QQuery 4     │  1132.22 ms │  1143.89 ms │     no change │
│ QQuery 5     │  1439.94 ms │  1548.88 ms │  1.08x slower │
│ QQuery 6     │    46.06 ms │    43.16 ms │ +1.07x faster │
│ QQuery 7     │    86.63 ms │    86.83 ms │     no change │
│ QQuery 8     │  1840.94 ms │  1836.42 ms │     no change │
│ QQuery 9     │  2142.35 ms │  2141.25 ms │     no change │
│ QQuery 10    │   562.25 ms │   551.77 ms │     no change │
│ QQuery 11    │   612.86 ms │   647.94 ms │  1.06x slower │
│ QQuery 12    │  1692.83 ms │  1651.23 ms │     no change │
│ QQuery 13    │  2755.19 ms │  2699.53 ms │     no change │
│ QQuery 14    │  1571.17 ms │  1620.12 ms │     no change │
│ QQuery 15    │  1619.18 ms │  1649.79 ms │     no change │
│ QQuery 16    │  3235.65 ms │  3235.25 ms │     no change │
│ QQuery 17    │  2820.07 ms │  2857.36 ms │     no change │
│ QQuery 18    │  6210.01 ms │  5920.66 ms │     no change │
│ QQuery 19    │   159.65 ms │   158.77 ms │     no change │
│ QQuery 20    │  2013.41 ms │  1929.52 ms │     no change │
│ QQuery 21    │  2378.85 ms │  2307.42 ms │     no change │
│ QQuery 22    │  4003.49 ms │  4006.98 ms │     no change │
│ QQuery 23    │ 15582.81 ms │ 15369.75 ms │     no change │
│ QQuery 24    │   911.63 ms │   875.75 ms │     no change │
│ QQuery 25    │   749.19 ms │   748.95 ms │     no change │
│ QQuery 26    │  1002.11 ms │   990.34 ms │     no change │
│ QQuery 27    │  2852.13 ms │  2868.56 ms │     no change │
│ QQuery 28    │ 22829.31 ms │ 22958.81 ms │     no change │
│ QQuery 29    │  1192.45 ms │  1191.21 ms │     no change │
│ QQuery 30    │  1658.09 ms │  1697.88 ms │     no change │
│ QQuery 31    │  1727.92 ms │  1733.49 ms │     no change │
│ QQuery 32    │  5707.58 ms │  5705.04 ms │     no change │
│ QQuery 33    │  6534.26 ms │  6312.33 ms │     no change │
│ QQuery 34    │  6814.26 ms │  7162.10 ms │  1.05x slower │
│ QQuery 35    │  2301.72 ms │  2423.41 ms │  1.05x slower │
│ QQuery 36    │   168.31 ms │   162.38 ms │     no change │
│ QQuery 37    │    88.48 ms │    87.41 ms │     no change │
│ QQuery 38    │   178.01 ms │   175.14 ms │     no change │
│ QQuery 39    │   281.96 ms │   271.88 ms │     no change │
│ QQuery 40    │    76.88 ms │    74.45 ms │     no change │
│ QQuery 41    │    68.88 ms │    72.43 ms │  1.05x slower │
│ QQuery 42    │    63.90 ms │    68.55 ms │  1.07x slower │
└──────────────┴─────────────┴─────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┓
┃ Benchmark Summary       ┃             ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━┩
│ Total Time (baseline)   │ 107560.26ms │
│ Total Time (branch)     │ 107441.36ms │
│ Average Time (baseline) │   2501.40ms │
│ Average Time (branch)   │   2498.64ms │
│ Queries Faster          │           2 │
│ Queries Slower          │           6 │
│ Queries with No Change  │          35 │
│ Queries with Failure    │           0 │
└─────────────────────────┴─────────────┘
--------------------
Benchmark tpch_mem_sf1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━┓
┃ Query        ┃  baseline ┃    branch ┃       Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━┩
│ QQuery 1     │ 220.77 ms │ 237.23 ms │ 1.07x slower │
│ QQuery 2     │  40.18 ms │  42.72 ms │ 1.06x slower │
│ QQuery 3     │  55.84 ms │  71.40 ms │ 1.28x slower │
│ QQuery 4     │  35.93 ms │  40.89 ms │ 1.14x slower │
│ QQuery 5     │  92.56 ms │ 117.69 ms │ 1.27x slower │
│ QQuery 6     │  21.60 ms │  26.50 ms │ 1.23x slower │
│ QQuery 7     │ 180.65 ms │ 291.03 ms │ 1.61x slower │
│ QQuery 8     │  39.94 ms │  44.46 ms │ 1.11x slower │
│ QQuery 9     │ 118.20 ms │ 162.73 ms │ 1.38x slower │
│ QQuery 10    │  78.63 ms │ 117.56 ms │ 1.50x slower │
│ QQuery 11    │  17.29 ms │  19.05 ms │ 1.10x slower │
│ QQuery 12    │  89.51 ms │ 108.56 ms │ 1.21x slower │
│ QQuery 13    │  59.26 ms │  66.55 ms │ 1.12x slower │
│ QQuery 14    │  17.51 ms │  21.44 ms │ 1.22x slower │
│ QQuery 15    │  34.14 ms │  40.69 ms │ 1.19x slower │
│ QQuery 16    │  39.74 ms │  39.22 ms │    no change │
│ QQuery 17    │ 146.47 ms │ 145.61 ms │    no change │
│ QQuery 18    │ 430.93 ms │ 423.05 ms │    no change │
│ QQuery 19    │  52.45 ms │  53.23 ms │    no change │
│ QQuery 20    │  74.50 ms │  79.03 ms │ 1.06x slower │
│ QQuery 21    │ 286.87 ms │ 307.48 ms │ 1.07x slower │
│ QQuery 22    │  42.56 ms │  42.50 ms │    no change │
└──────────────┴───────────┴───────────┴──────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary       ┃           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (baseline)   │ 2175.53ms │
│ Total Time (branch)     │ 2498.61ms │
│ Average Time (baseline) │   98.89ms │
│ Average Time (branch)   │  113.57ms │
│ Queries Faster          │         0 │
│ Queries Slower          │        17 │
│ Queries with No Change  │         5 │
│ Queries with Failure    │         0 │
└─────────────────────────┴───────────┘

@pepijnve pepijnve marked this pull request as draft June 10, 2025 13:08
@pepijnve
Copy link
Contributor Author

Googling a bit I'm starting to get the impression that I shouldn't be thinking Tokio tasks are as lightweight as coroutines in some other ecosystems.

@pepijnve
Copy link
Contributor Author

#16357 might be relevant here. I was testing on an old system with spinning disks. Going to retest with more iterations and this change to make sure I'm not measuring noise.

@pepijnve
Copy link
Contributor Author

pepijnve commented Jun 11, 2025

@alamb @Dandandan I started a benchmark run with 50 iterations and the TPCH benchmark change that eliminates local filesystem access yesterday evening. Checked the results this morning... 😌 Would be great if someone else could confirm.

If you would like to reproduce this was
baseline b41acf3c43ff259a22f9ecc2abdb17db58297fd8
vs
branch 91f2d75ced0ebd17e20006c9ad0cd75403261e3b
with the following patch applied to bench.sh https://gist.github.com/pepijnve/c5498e4762730bd68a2f6b188ed20f45

Long story short, in my environment at least, the 5 iterations setting the bench.sh script uses is just too little. Especially in the first 5-10 iterations I see way too much variability in the runs for it to be useful. It stabilizes later on.

I wonder if it would be a good idea to modify the benchmark code to always do a number of warmup iterations before we actually start measuring.

Comparing baseline and branch
--------------------
Benchmark clickbench_extended.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Query        ┃    baseline ┃      branch ┃    Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ QQuery 0     │  2841.73 ms │  2868.70 ms │ no change │
│ QQuery 1     │  1254.60 ms │  1259.50 ms │ no change │
│ QQuery 2     │  2444.16 ms │  2417.82 ms │ no change │
│ QQuery 3     │  1075.16 ms │  1046.80 ms │ no change │
│ QQuery 4     │  2759.06 ms │  2800.21 ms │ no change │
│ QQuery 5     │ 32991.04 ms │ 33376.96 ms │ no change │
│ QQuery 6     │  3591.27 ms │  3577.60 ms │ no change │
│ QQuery 7     │  4174.48 ms │  4260.27 ms │ no change │
│ QQuery 8     │  1562.72 ms │  1541.47 ms │ no change │
└──────────────┴─────────────┴─────────────┴───────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary       ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (baseline)   │ 52694.22ms │
│ Total Time (branch)     │ 53149.33ms │
│ Average Time (baseline) │  5854.91ms │
│ Average Time (branch)   │  5905.48ms │
│ Queries Faster          │          0 │
│ Queries Slower          │          0 │
│ Queries with No Change  │          9 │
│ Queries with Failure    │          0 │
└─────────────────────────┴────────────┘
--------------------
Benchmark clickbench_partitioned.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃    baseline ┃      branch ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0     │    28.47 ms │    28.17 ms │     no change │
│ QQuery 1     │    66.76 ms │    68.23 ms │     no change │
│ QQuery 2     │   158.80 ms │   158.78 ms │     no change │
│ QQuery 3     │   156.06 ms │   147.87 ms │ +1.06x faster │
│ QQuery 4     │  1169.02 ms │  1170.60 ms │     no change │
│ QQuery 5     │  1498.30 ms │  1546.96 ms │     no change │
│ QQuery 6     │    41.48 ms │    41.42 ms │     no change │
│ QQuery 7     │    80.51 ms │    79.37 ms │     no change │
│ QQuery 8     │  1831.67 ms │  1817.91 ms │     no change │
│ QQuery 9     │  2152.33 ms │  2190.01 ms │     no change │
│ QQuery 10    │   531.87 ms │   545.53 ms │     no change │
│ QQuery 11    │   592.01 ms │   603.65 ms │     no change │
│ QQuery 12    │  1651.35 ms │  1679.20 ms │     no change │
│ QQuery 13    │  2783.70 ms │  2847.50 ms │     no change │
│ QQuery 14    │  1589.41 ms │  1597.90 ms │     no change │
│ QQuery 15    │  1620.37 ms │  1629.00 ms │     no change │
│ QQuery 16    │  3332.63 ms │  3354.37 ms │     no change │
│ QQuery 17    │  2864.46 ms │  2850.38 ms │     no change │
│ QQuery 18    │  6226.51 ms │  5993.04 ms │     no change │
│ QQuery 19    │   138.61 ms │   140.21 ms │     no change │
│ QQuery 20    │  1892.60 ms │  1910.23 ms │     no change │
│ QQuery 21    │  2292.32 ms │  2325.51 ms │     no change │
│ QQuery 22    │  3944.51 ms │  4020.94 ms │     no change │
│ QQuery 23    │ 15570.73 ms │ 15691.75 ms │     no change │
│ QQuery 24    │   862.29 ms │   875.83 ms │     no change │
│ QQuery 25    │   727.49 ms │   721.70 ms │     no change │
│ QQuery 26    │   975.89 ms │   986.73 ms │     no change │
│ QQuery 27    │  2697.31 ms │  2769.85 ms │     no change │
│ QQuery 28    │ 22911.95 ms │ 22695.04 ms │     no change │
│ QQuery 29    │  1172.68 ms │  1173.51 ms │     no change │
│ QQuery 30    │  1634.30 ms │  1637.76 ms │     no change │
│ QQuery 31    │  1718.66 ms │  1717.93 ms │     no change │
│ QQuery 32    │  5527.03 ms │  5810.12 ms │  1.05x slower │
│ QQuery 33    │  6763.33 ms │  6654.45 ms │     no change │
│ QQuery 34    │  7016.63 ms │  6971.15 ms │     no change │
│ QQuery 35    │  2308.46 ms │  2308.33 ms │     no change │
│ QQuery 36    │   167.09 ms │   170.41 ms │     no change │
│ QQuery 37    │    86.96 ms │    87.12 ms │     no change │
│ QQuery 38    │   167.31 ms │   172.42 ms │     no change │
│ QQuery 39    │   268.56 ms │   269.12 ms │     no change │
│ QQuery 40    │    75.72 ms │    76.78 ms │     no change │
│ QQuery 41    │    73.17 ms │    69.69 ms │     no change │
│ QQuery 42    │    61.68 ms │    59.77 ms │     no change │
└──────────────┴─────────────┴─────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┓
┃ Benchmark Summary       ┃             ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━┩
│ Total Time (baseline)   │ 107430.99ms │
│ Total Time (branch)     │ 107666.23ms │
│ Average Time (baseline) │   2498.40ms │
│ Average Time (branch)   │   2503.87ms │
│ Queries Faster          │           1 │
│ Queries Slower          │           1 │
│ Queries with No Change  │          41 │
│ Queries with Failure    │           0 │
└─────────────────────────┴─────────────┘
--------------------
Benchmark tpch_mem_sf1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃  baseline ┃    branch ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1     │ 253.93 ms │ 213.68 ms │ +1.19x faster │
│ QQuery 2     │  41.31 ms │  35.04 ms │ +1.18x faster │
│ QQuery 3     │  70.88 ms │  56.32 ms │ +1.26x faster │
│ QQuery 4     │  39.62 ms │  35.32 ms │ +1.12x faster │
│ QQuery 5     │ 114.89 ms │  89.39 ms │ +1.29x faster │
│ QQuery 6     │  26.55 ms │  20.38 ms │ +1.30x faster │
│ QQuery 7     │ 190.78 ms │ 175.94 ms │ +1.08x faster │
│ QQuery 8     │  39.49 ms │  39.91 ms │     no change │
│ QQuery 9     │ 113.43 ms │ 113.57 ms │     no change │
│ QQuery 10    │  82.14 ms │  78.50 ms │     no change │
│ QQuery 11    │  18.26 ms │  17.51 ms │     no change │
│ QQuery 12    │  90.59 ms │  93.53 ms │     no change │
│ QQuery 13    │  55.18 ms │  51.19 ms │ +1.08x faster │
│ QQuery 14    │  17.07 ms │  15.67 ms │ +1.09x faster │
│ QQuery 15    │  35.12 ms │  33.25 ms │ +1.06x faster │
│ QQuery 16    │  38.82 ms │  37.71 ms │     no change │
│ QQuery 17    │ 143.93 ms │ 142.22 ms │     no change │
│ QQuery 18    │ 417.37 ms │ 442.36 ms │  1.06x slower │
│ QQuery 19    │  43.16 ms │  43.78 ms │     no change │
│ QQuery 20    │  71.70 ms │  72.96 ms │     no change │
│ QQuery 21    │ 297.31 ms │ 296.28 ms │     no change │
│ QQuery 22    │  41.19 ms │  41.35 ms │     no change │
└──────────────┴───────────┴───────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary       ┃           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (baseline)   │ 2242.73ms │
│ Total Time (branch)     │ 2145.86ms │
│ Average Time (baseline) │  101.94ms │
│ Average Time (branch)   │   97.54ms │
│ Queries Faster          │        10 │
│ Queries Slower          │         1 │
│ Queries with No Change  │        11 │
│ Queries with Failure    │         0 │
└─────────────────────────┴───────────┘

@pepijnve
Copy link
Contributor Author

pepijnve commented Jun 11, 2025

I did a second identical run because those result seemed just too good to be true to me. This is much closer to what I was expecting: more or less status quo. That does mean I'm back to getting wildly differing results which I can't really explain. This is on an old repurposed PowerEdge r730 running in a Ubuntu 24.04 VM on esxi. It's the only VM on the machine so can't be noisy neighbors. Maybe the hardware is just starting to get flaky.

Fastest time comparison

Comparing baseline and branch
--------------------
Benchmark clickbench_extended.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Query        ┃    baseline ┃      branch ┃    Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ QQuery 0     │  2802.58 ms │  2835.58 ms │ no change │
│ QQuery 1     │  1267.40 ms │  1254.16 ms │ no change │
│ QQuery 2     │  2425.18 ms │  2431.64 ms │ no change │
│ QQuery 3     │  1049.51 ms │  1096.84 ms │ no change │
│ QQuery 4     │  2799.22 ms │  2812.48 ms │ no change │
│ QQuery 5     │ 33146.06 ms │ 32438.14 ms │ no change │
│ QQuery 6     │  3581.51 ms │  3608.01 ms │ no change │
│ QQuery 7     │  4244.94 ms │  4316.87 ms │ no change │
│ QQuery 8     │  1549.39 ms │  1566.28 ms │ no change │
└──────────────┴─────────────┴─────────────┴───────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary       ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (baseline)   │ 52865.81ms │
│ Total Time (branch)     │ 52360.00ms │
│ Average Time (baseline) │  5873.98ms │
│ Average Time (branch)   │  5817.78ms │
│ Queries Faster          │          0 │
│ Queries Slower          │          0 │
│ Queries with No Change  │          9 │
│ Queries with Failure    │          0 │
└─────────────────────────┴────────────┘
--------------------
Benchmark clickbench_partitioned.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃    baseline ┃      branch ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0     │    28.51 ms │    28.65 ms │     no change │
│ QQuery 1     │    66.92 ms │    67.47 ms │     no change │
│ QQuery 2     │   157.60 ms │   158.95 ms │     no change │
│ QQuery 3     │   149.93 ms │   149.32 ms │     no change │
│ QQuery 4     │  1168.56 ms │  1192.96 ms │     no change │
│ QQuery 5     │  1513.84 ms │  1522.57 ms │     no change │
│ QQuery 6     │    40.48 ms │    41.31 ms │     no change │
│ QQuery 7     │    82.32 ms │    77.97 ms │ +1.06x faster │
│ QQuery 8     │  1821.87 ms │  1835.17 ms │     no change │
│ QQuery 9     │  2179.03 ms │  2154.90 ms │     no change │
│ QQuery 10    │   535.16 ms │   531.71 ms │     no change │
│ QQuery 11    │   604.14 ms │   595.35 ms │     no change │
│ QQuery 12    │  1697.18 ms │  1636.90 ms │     no change │
│ QQuery 13    │  2807.76 ms │  2802.55 ms │     no change │
│ QQuery 14    │  1569.36 ms │  1613.79 ms │     no change │
│ QQuery 15    │  1627.10 ms │  1627.11 ms │     no change │
│ QQuery 16    │  3269.69 ms │  3315.99 ms │     no change │
│ QQuery 17    │  2856.85 ms │  2846.55 ms │     no change │
│ QQuery 18    │  6091.55 ms │  5958.27 ms │     no change │
│ QQuery 19    │   140.87 ms │   138.93 ms │     no change │
│ QQuery 20    │  1887.14 ms │  1893.19 ms │     no change │
│ QQuery 21    │  2269.57 ms │  2281.36 ms │     no change │
│ QQuery 22    │  3931.31 ms │  3954.27 ms │     no change │
│ QQuery 23    │ 15449.02 ms │ 15582.96 ms │     no change │
│ QQuery 24    │   854.06 ms │   869.83 ms │     no change │
│ QQuery 25    │   737.82 ms │   728.33 ms │     no change │
│ QQuery 26    │   958.68 ms │   991.79 ms │     no change │
│ QQuery 27    │  2704.35 ms │  2695.40 ms │     no change │
│ QQuery 28    │ 22518.40 ms │ 22488.05 ms │     no change │
│ QQuery 29    │  1172.40 ms │  1178.97 ms │     no change │
│ QQuery 30    │  1647.91 ms │  1643.26 ms │     no change │
│ QQuery 31    │  1716.99 ms │  1726.10 ms │     no change │
│ QQuery 32    │  5534.67 ms │  5685.19 ms │     no change │
│ QQuery 33    │  6824.90 ms │  6887.88 ms │     no change │
│ QQuery 34    │  7005.46 ms │  7040.82 ms │     no change │
│ QQuery 35    │  2293.87 ms │  2313.10 ms │     no change │
│ QQuery 36    │   165.05 ms │   163.79 ms │     no change │
│ QQuery 37    │    87.79 ms │    84.71 ms │     no change │
│ QQuery 38    │   165.88 ms │   171.22 ms │     no change │
│ QQuery 39    │   272.32 ms │   275.14 ms │     no change │
│ QQuery 40    │    77.26 ms │    78.16 ms │     no change │
│ QQuery 41    │    67.84 ms │    67.96 ms │     no change │
│ QQuery 42    │    61.56 ms │    60.86 ms │     no change │
└──────────────┴─────────────┴─────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┓
┃ Benchmark Summary       ┃             ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━┩
│ Total Time (baseline)   │ 106812.96ms │
│ Total Time (branch)     │ 107158.75ms │
│ Average Time (baseline) │   2484.02ms │
│ Average Time (branch)   │   2492.06ms │
│ Queries Faster          │           1 │
│ Queries Slower          │           0 │
│ Queries with No Change  │          42 │
│ Queries with Failure    │           0 │
└─────────────────────────┴─────────────┘
--------------------
Benchmark tpch_mem_sf1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━┓
┃ Query        ┃  baseline ┃    branch ┃       Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━┩
│ QQuery 1     │ 213.35 ms │ 202.86 ms │    no change │
│ QQuery 2     │  35.88 ms │  36.49 ms │    no change │
│ QQuery 3     │  56.60 ms │  54.00 ms │    no change │
│ QQuery 4     │  35.63 ms │  35.54 ms │    no change │
│ QQuery 5     │  90.15 ms │  87.55 ms │    no change │
│ QQuery 6     │  20.33 ms │  20.00 ms │    no change │
│ QQuery 7     │ 192.07 ms │ 185.72 ms │    no change │
│ QQuery 8     │  38.70 ms │  39.42 ms │    no change │
│ QQuery 9     │ 115.00 ms │ 114.61 ms │    no change │
│ QQuery 10    │  77.92 ms │  79.01 ms │    no change │
│ QQuery 11    │  17.05 ms │  17.88 ms │    no change │
│ QQuery 12    │  91.61 ms │  92.90 ms │    no change │
│ QQuery 13    │  54.55 ms │  52.40 ms │    no change │
│ QQuery 14    │  15.72 ms │  16.13 ms │    no change │
│ QQuery 15    │  33.06 ms │  32.31 ms │    no change │
│ QQuery 16    │  35.86 ms │  38.55 ms │ 1.07x slower │
│ QQuery 17    │ 147.53 ms │ 142.63 ms │    no change │
│ QQuery 18    │ 461.84 ms │ 468.67 ms │    no change │
│ QQuery 19    │  43.45 ms │  43.48 ms │    no change │
│ QQuery 20    │  71.49 ms │  73.57 ms │    no change │
│ QQuery 21    │ 298.95 ms │ 284.75 ms │    no change │
│ QQuery 22    │  41.49 ms │  41.23 ms │    no change │
└──────────────┴───────────┴───────────┴──────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary       ┃           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (baseline)   │ 2188.25ms │
│ Total Time (branch)     │ 2159.69ms │
│ Average Time (baseline) │   99.47ms │
│ Average Time (branch)   │   98.17ms │
│ Queries Faster          │         0 │
│ Queries Slower          │         1 │
│ Queries with No Change  │        21 │
│ Queries with Failure    │         0 │
└─────────────────────────┴───────────┘

Average time comparison

Comparing baseline and branch
--------------------
Benchmark clickbench_extended.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Query        ┃                                  baseline ┃                                    branch ┃    Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ QQuery 0     │    2802.58 / 2963.09 ±251.63 / 4691.94 ms │     2835.58 / 2934.12 ±47.94 / 3064.52 ms │ no change │
│ QQuery 1     │    1267.40 / 1414.57 ±262.53 / 2614.65 ms │    1254.16 / 1354.54 ±131.10 / 1709.84 ms │ no change │
│ QQuery 2     │    2425.18 / 2631.18 ±177.69 / 3527.31 ms │    2431.64 / 2659.94 ±160.09 / 3378.51 ms │ no change │
│ QQuery 3     │    1049.51 / 1198.24 ±291.07 / 3179.01 ms │     1096.84 / 1214.67 ±92.19 / 1556.04 ms │ no change │
│ QQuery 4     │    2799.22 / 3020.58 ±678.96 / 7729.45 ms │     2812.48 / 2930.29 ±68.70 / 3063.49 ms │ no change │
│ QQuery 5     │ 33146.06 / 34617.23 ±911.91 / 37087.63 ms │ 32438.14 / 33904.73 ±827.35 / 36644.43 ms │ no change │
│ QQuery 6     │  3581.51 / 3868.17 ±1263.59 / 12702.02 ms │    3608.01 / 3826.89 ±666.62 / 8443.93 ms │ no change │
│ QQuery 7     │    4244.94 / 4507.18 ±155.52 / 4910.67 ms │    4316.87 / 4619.75 ±230.84 / 5297.05 ms │ no change │
│ QQuery 8     │    1549.39 / 1685.38 ±166.38 / 2652.95 ms │     1566.28 / 1674.93 ±87.68 / 2071.30 ms │ no change │
└──────────────┴───────────────────────────────────────────┴───────────────────────────────────────────┴───────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary       ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (baseline)   │ 55905.60ms │
│ Total Time (branch)     │ 55119.86ms │
│ Average Time (baseline) │  6211.73ms │
│ Average Time (branch)   │  6124.43ms │
│ Queries Faster          │          0 │
│ Queries Slower          │          0 │
│ Queries with No Change  │          9 │
│ Queries with Failure    │          0 │
└─────────────────────────┴────────────┘
--------------------
Benchmark clickbench_partitioned.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃                                   baseline ┃                                     branch ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0     │             28.51 / 29.89 ±1.27 / 36.56 ms │           28.65 / 31.36 ±10.27 / 103.01 ms │     no change │
│ QQuery 1     │             66.92 / 70.21 ±2.25 / 80.92 ms │            67.47 / 71.60 ±6.13 / 112.58 ms │     no change │
│ QQuery 2     │          157.60 / 165.04 ±4.49 / 182.03 ms │          158.95 / 165.73 ±4.68 / 179.32 ms │     no change │
│ QQuery 3     │         149.93 / 174.33 ±74.25 / 688.45 ms │         149.32 / 174.21 ±36.04 / 407.10 ms │     no change │
│ QQuery 4     │      1168.56 / 1199.94 ±16.32 / 1257.49 ms │      1192.96 / 1236.15 ±38.04 / 1470.41 ms │     no change │
│ QQuery 5     │      1513.84 / 1633.65 ±47.17 / 1790.47 ms │      1522.57 / 1644.26 ±68.90 / 1955.04 ms │     no change │
│ QQuery 6     │             40.48 / 43.91 ±5.54 / 81.38 ms │             41.31 / 44.17 ±4.56 / 72.76 ms │     no change │
│ QQuery 7     │            82.32 / 86.93 ±4.39 / 111.41 ms │             77.97 / 87.38 ±3.57 / 95.62 ms │     no change │
│ QQuery 8     │      1821.87 / 1900.89 ±45.92 / 2001.93 ms │      1835.17 / 1898.51 ±39.20 / 1992.75 ms │     no change │
│ QQuery 9     │     2179.03 / 2339.45 ±105.69 / 2647.09 ms │     2154.90 / 2369.20 ±107.72 / 2644.63 ms │     no change │
│ QQuery 10    │         535.16 / 580.19 ±26.31 / 693.60 ms │         531.71 / 571.51 ±20.44 / 631.13 ms │     no change │
│ QQuery 11    │         604.14 / 643.68 ±20.76 / 697.50 ms │         595.35 / 644.37 ±20.77 / 684.46 ms │     no change │
│ QQuery 12    │      1697.18 / 1798.76 ±62.88 / 1931.04 ms │      1636.90 / 1794.33 ±65.68 / 1923.18 ms │     no change │
│ QQuery 13    │     2807.76 / 3041.73 ±226.03 / 3915.84 ms │      2802.55 / 2950.12 ±74.95 / 3140.34 ms │     no change │
│ QQuery 14    │      1569.36 / 1733.54 ±95.18 / 1932.89 ms │      1613.79 / 1737.33 ±67.77 / 1904.11 ms │     no change │
│ QQuery 15    │      1627.10 / 1695.71 ±40.49 / 1826.53 ms │      1627.11 / 1688.73 ±31.56 / 1769.50 ms │     no change │
│ QQuery 16    │     3269.69 / 3469.53 ±102.04 / 3837.76 ms │      3315.99 / 3444.72 ±64.53 / 3608.20 ms │     no change │
│ QQuery 17    │     2856.85 / 2980.43 ±117.00 / 3357.01 ms │      2846.55 / 2926.42 ±53.44 / 3145.16 ms │     no change │
│ QQuery 18    │     6091.55 / 6319.86 ±258.78 / 7569.19 ms │     5958.27 / 6190.21 ±189.22 / 7217.12 ms │     no change │
│ QQuery 19    │         140.87 / 155.08 ±12.33 / 194.49 ms │          138.93 / 153.28 ±7.48 / 181.61 ms │     no change │
│ QQuery 20    │     1887.14 / 2104.13 ±982.87 / 8971.27 ms │      1893.19 / 1949.39 ±31.69 / 2057.54 ms │ +1.08x faster │
│ QQuery 21    │      2269.57 / 2416.94 ±83.76 / 2696.47 ms │     2281.36 / 2428.54 ±127.15 / 2981.81 ms │     no change │
│ QQuery 22    │     3931.31 / 4209.84 ±667.99 / 8797.19 ms │     3954.27 / 4107.10 ±109.18 / 4530.78 ms │     no change │
│ QQuery 23    │ 15449.02 / 16371.15 ±1439.53 / 26073.33 ms │ 15582.96 / 16394.40 ±1437.94 / 23730.98 ms │     no change │
│ QQuery 24    │        854.06 / 924.57 ±33.87 / 1063.17 ms │        869.83 / 930.07 ±38.06 / 1040.90 ms │     no change │
│ QQuery 25    │         737.82 / 792.69 ±38.67 / 920.59 ms │         728.33 / 796.36 ±29.59 / 865.62 ms │     no change │
│ QQuery 26    │       958.68 / 1057.55 ±48.26 / 1142.72 ms │       991.79 / 1050.74 ±38.57 / 1167.79 ms │     no change │
│ QQuery 27    │     2704.35 / 2935.20 ±175.93 / 3555.23 ms │      2695.40 / 2879.40 ±72.44 / 3031.48 ms │     no change │
│ QQuery 28    │  22518.40 / 23686.71 ±730.27 / 26633.39 ms │  22488.05 / 23441.76 ±643.62 / 25994.75 ms │     no change │
│ QQuery 29    │      1172.40 / 1218.68 ±75.71 / 1553.79 ms │      1178.97 / 1200.93 ±25.99 / 1291.04 ms │     no change │
│ QQuery 30    │      1647.91 / 1722.73 ±42.27 / 1813.07 ms │      1643.26 / 1773.18 ±74.69 / 1979.09 ms │     no change │
│ QQuery 31    │      1716.99 / 1786.66 ±43.92 / 1894.28 ms │      1726.10 / 1826.32 ±50.00 / 1950.78 ms │     no change │
│ QQuery 32    │     5534.67 / 5820.60 ±218.84 / 6989.35 ms │     5685.19 / 6000.02 ±400.88 / 8277.02 ms │     no change │
│ QQuery 33    │     6824.90 / 7282.43 ±295.14 / 8580.30 ms │     6887.88 / 7262.52 ±211.72 / 8010.74 ms │     no change │
│ QQuery 34    │     7005.46 / 7299.12 ±188.16 / 8074.88 ms │     7040.82 / 7296.55 ±212.93 / 7995.89 ms │     no change │
│ QQuery 35    │      2293.87 / 2411.59 ±90.03 / 2719.31 ms │      2313.10 / 2434.75 ±78.32 / 2795.40 ms │     no change │
│ QQuery 36    │         165.05 / 186.82 ±13.85 / 241.42 ms │          163.79 / 187.45 ±9.87 / 219.34 ms │     no change │
│ QQuery 37    │            87.79 / 99.94 ±8.32 / 132.57 ms │           84.71 / 101.09 ±6.30 / 113.19 ms │     no change │
│ QQuery 38    │         165.88 / 186.07 ±11.11 / 221.26 ms │          171.22 / 183.28 ±8.06 / 215.94 ms │     no change │
│ QQuery 39    │         272.32 / 299.01 ±17.41 / 386.48 ms │         275.14 / 303.11 ±21.58 / 401.13 ms │     no change │
│ QQuery 40    │           77.26 / 94.63 ±14.71 / 175.02 ms │            78.16 / 93.84 ±8.34 / 117.48 ms │     no change │
│ QQuery 41    │            67.84 / 82.90 ±6.63 / 102.68 ms │            67.96 / 81.97 ±6.47 / 100.79 ms │     no change │
│ QQuery 42    │             61.56 / 69.46 ±5.49 / 93.13 ms │             60.86 / 69.14 ±5.02 / 86.23 ms │     no change │
└──────────────┴────────────────────────────────────────────┴────────────────────────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┓
┃ Benchmark Summary       ┃             ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━┩
│ Total Time (baseline)   │ 113122.12ms │
│ Total Time (branch)     │ 112615.51ms │
│ Average Time (baseline) │   2630.75ms │
│ Average Time (branch)   │   2618.97ms │
│ Queries Faster          │           1 │
│ Queries Slower          │           0 │
│ Queries with No Change  │          42 │
│ Queries with Failure    │           0 │
└─────────────────────────┴─────────────┘
--------------------
Benchmark tpch_mem_sf1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃                           baseline ┃                             branch ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1     │ 213.35 / 233.36 ±39.56 / 505.52 ms │ 202.86 / 228.40 ±21.40 / 372.45 ms │     no change │
│ QQuery 2     │     35.88 / 42.48 ±5.34 / 73.45 ms │     36.49 / 39.68 ±2.13 / 47.68 ms │ +1.07x faster │
│ QQuery 3     │     56.60 / 61.36 ±2.47 / 67.91 ms │     54.00 / 61.27 ±2.57 / 68.00 ms │     no change │
│ QQuery 4     │     35.63 / 39.46 ±2.45 / 48.66 ms │     35.54 / 40.02 ±2.62 / 46.40 ms │     no change │
│ QQuery 5     │    90.15 / 96.80 ±5.15 / 114.22 ms │    87.55 / 96.56 ±5.00 / 109.80 ms │     no change │
│ QQuery 6     │     20.33 / 21.73 ±3.23 / 42.72 ms │     20.00 / 21.56 ±2.78 / 34.98 ms │     no change │
│ QQuery 7     │ 192.07 / 217.62 ±13.44 / 249.42 ms │ 185.72 / 210.55 ±12.24 / 238.83 ms │     no change │
│ QQuery 8     │     38.70 / 42.91 ±3.50 / 53.08 ms │     39.42 / 41.89 ±2.39 / 52.00 ms │     no change │
│ QQuery 9     │  115.00 / 125.20 ±5.39 / 138.33 ms │  114.61 / 121.78 ±4.36 / 141.53 ms │     no change │
│ QQuery 10    │     77.92 / 86.79 ±3.93 / 98.53 ms │     79.01 / 85.09 ±3.83 / 95.31 ms │     no change │
│ QQuery 11    │     17.05 / 19.27 ±1.34 / 23.65 ms │     17.88 / 20.25 ±1.32 / 24.47 ms │  1.05x slower │
│ QQuery 12    │   91.61 / 103.12 ±5.13 / 117.49 ms │   92.90 / 100.37 ±4.42 / 107.90 ms │     no change │
│ QQuery 13    │     54.55 / 58.94 ±2.66 / 66.49 ms │     52.40 / 57.94 ±2.79 / 65.33 ms │     no change │
│ QQuery 14    │     15.72 / 17.86 ±1.46 / 22.60 ms │     16.13 / 17.94 ±1.84 / 24.08 ms │     no change │
│ QQuery 15    │     33.06 / 35.80 ±2.17 / 48.02 ms │     32.31 / 36.61 ±2.03 / 42.38 ms │     no change │
│ QQuery 16    │     35.86 / 42.12 ±2.67 / 49.40 ms │     38.55 / 43.11 ±2.40 / 47.93 ms │     no change │
│ QQuery 17    │  147.53 / 155.81 ±4.46 / 170.19 ms │  142.63 / 150.01 ±4.00 / 164.66 ms │     no change │
│ QQuery 18    │ 461.84 / 568.46 ±29.48 / 622.19 ms │ 468.67 / 576.62 ±33.56 / 657.56 ms │     no change │
│ QQuery 19    │    43.45 / 47.74 ±8.66 / 103.29 ms │     43.48 / 46.56 ±6.18 / 87.64 ms │     no change │
│ QQuery 20    │     71.49 / 84.43 ±6.07 / 97.98 ms │     73.57 / 82.47 ±4.76 / 94.35 ms │     no change │
│ QQuery 21    │ 298.95 / 327.64 ±13.72 / 361.05 ms │ 284.75 / 319.27 ±14.01 / 364.19 ms │     no change │
│ QQuery 22    │     41.49 / 47.43 ±5.02 / 66.52 ms │     41.23 / 46.38 ±4.95 / 62.58 ms │     no change │
└──────────────┴────────────────────────────────────┴────────────────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary       ┃           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (baseline)   │ 2476.32ms │
│ Total Time (branch)     │ 2444.34ms │
│ Average Time (baseline) │  112.56ms │
│ Average Time (branch)   │  111.11ms │
│ Queries Faster          │         1 │
│ Queries Slower          │         1 │
│ Queries with No Change  │        20 │
│ Queries with Failure    │         0 │
└─────────────────────────┴───────────┘

@pepijnve
Copy link
Contributor Author

🤦‍♂️ that's not very useful now is it. I need a better machine to test on

min/avg/max

1887.14 / 2104.13 ±982.87 / 8971.27 ms │      1893.19 / 1949.39 ±31.69 / 2057.54 ms

@pepijnve pepijnve force-pushed the issue_16318 branch 2 times, most recently from c3c7918 to b708c11 Compare June 22, 2025 09:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
common Related to common crate physical-plan Changes to the physical-plan crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Reduce busy-waiting when query contains pipeline blocking operators
3 participants