-
Notifications
You must be signed in to change notification settings - Fork 473
compute: time based linear join yielding #22391
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Very excited to try this.
Yes! Absolutely agree. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The change itself looks fine to me; it is, though, appreciated that this is gated by a feature flag and disabled by default.
One of the risks we should probably get a feeling for is how much consolidation potential is left on the table when we force flushes to happen based on time. An initial evaluation of this risk can be done for example with either the scenario SkewedJoin
of the feature benchmark or the internal repro from which it took inspiration (perhaps scaling the data size with the former is the easiest). It would be interesting to see the total number of messages as well as its distribution among workers. Using unbilled replicas to check the impact on similar scenarios after this initial validation is also a great idea.
Note about the above: In both SkewedJoin
and the internal repro, it's important to have the entire schema created first and only then feed data through persist. This is to get repeatable message counts due to our consistency guarantees.
Good callout! In general I would expect a switch to time-based yielding to (a) reduce throughput and (b) reduce consolidation of the join output. Of course that depends on all sort of things. For example, I assume that with time-based yielding we will yield more often than with work-based yielding, but that doesn't always need to be true. I ran both the
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for getting these additional numbers! It's indeed not guaranteed that we'd yield more often with the time-based policy, only when we are able to "recoup" a lot of fuel due to consolidation (because then the fuel-based strategy can catch more air to run, while time-based imposes a wall). So there should be a relationship between skew, data size, and time-to-yield (e.g., reducing the time-to-yield increases the risk that you hit the wall more often).
I think we have some evidence now, though, that hitting the time-based wall is not so easy. So I'd be fine with merging this PR, but I suggest that we keep to proceeding with caution (leaving this off by default, testing with unbilled replicas, etc).
Recording an insight from a discussion with @antiguru: We should consider yielding by both time and work:
I don't plan to make any changes to this PR, as we need to do some testing anyway, but a follow-up should extend the |
I think we can go ahead with this PR (I'll review in a jiffy), but I don't think we should make this the default just yet. At the moment, the join implementation yields to give downstream operators the chance to consumer their inputs, avoiding OOMs because the join produced too much data. At the same time, the join needs to consume its inputs ASAP because every time it yields, more inputs might appear. Yielding more frequently can cause a different OOM behavior than what we've seen until now. For this reason, I'm not convinced that we should use time-based yielding by default. Besides this, the fact that the join takes a long time to consume its inputs is clearly a problem. Changing the yield behavior mitigates the symptom, but does not solve the problem:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, looks good. Left some comments!
b232bb0
to
6407410
Compare
This PR has higher risk. Make sure to carefully review the file hotspots. In addition to having this change reviewed, adequate tests should be considered and it may be useful to add observability and/or a feature flag. What's This?
Buggy File Hotspots:
|
We should consider all these things, but I think we'll still need time-based yielding (in addition to work-based) because we won't be able to rule out long-running joins entirely. Cross-joins are the easiest way to produce them, but any equi-join will become a cross-join if the data is just skewed enough. So we'd also need to think about preventing data skew in the join inputs. And even if we manage that, a join will still be slow if it the amount of data it has to crunch through is just large enough. |
6407410
to
b667fa4
Compare
This commit provides `mz_join_core` with a yield functions, inspired by `persist_source` and DD's half join operator, that allows the caller to control the yield behavior by time or by amount of work performed.
This commit introduces the plumbing required to allow users of Compute to specify the yielding behavior of linear join operators via `UpdateConfiguration` commands.
This commit adds a new `SystemVar` called `linear_join_yielding` that can be used to control the yielding behavior of linear joins rendered by the compute layer.
b667fa4
to
a8c2cba
Compare
TFTRs! |
This PR adds support for configuring compute rendering so that it employs time-based (rather than work-based) yielding for linear joins implemented by
mz_join_core
.The changes made here include:
mz_join_core
with ayield_fn
that allows configuring the yielding strategy from outside (1st commit).mz_join_core
yielding behavior can be configured throughUpdateConfiguration
commands (2nd commit).linear_join_yielding
, to allow changing the yield spec from LD (3rd commit).So far the default is still the previous behavior of yielding after 1 million updates produced (i.e.
work:1000000
). Once we were able to validate that time-based yielding doesn't produce significant regressions, we can make that the default.Example
As a data point that time-based yielding can improve things, consider the example from MaterializeInc/database-issues#6761:
On my system, the join's duration histogram looks like this:
work:1000000
:time:100
:Motivation
Addresses MaterializeInc/database-issues#6761 and touches MaterializeInc/database-issues#6497.
Tips for reviewer
Changing yielding behavior of operators is always scary because it can have implications on flow control. That's why this PR introduces time-based yielding in the most careful way possible, by gating it behind a system var and keeping it off by default. We need to run further tests to validate the new yielding strategy. I assume that unbilled replicas will be very helpful for this!
Checklist
$T ⇔ Proto$T
mapping (possibly in a backwards-incompatible way), then it is tagged with aT-proto
label.