compute: time based linear join yielding #22391

teskje · 2023-10-14T13:28:34Z

This PR adds support for configuring compute rendering so that it employs time-based (rather than work-based) yielding for linear joins implemented by mz_join_core.

The changes made here include:

Providing mz_join_core with a yield_fn that allows configuring the yielding strategy from outside (1st commit).
Plumbing a yield spec through compute so the mz_join_core yielding behavior can be configured through UpdateConfiguration commands (2nd commit).
Adding a new system var, linear_join_yielding, to allow changing the yield spec from LD (3rd commit).

So far the default is still the previous behavior of yielding after 1 million updates produced (i.e. work:1000000). Once we were able to validate that time-based yielding doesn't produce significant regressions, we can make that the default.

Example

As a data point that time-based yielding can improve things, consider the example from MaterializeInc/database-issues#6761:

CREATE TABLE t (a int);
CREATE MATERIALIZED VIEW mv AS SELECT (t1.a + t2.a) % 2 FROM t t1, t t2;
INSERT INTO t SELECT generate_series(1, 10000);

On my system, the join's duration histogram looks like this:

with work:1000000:

  id  | name |          dataflow_name          | count |    duration     | duration_ns
------+------+---------------------------------+-------+-----------------+-------------
 3126 | Join | Dataflow: materialize.public.mv |     1 | 00:00:34.359738 | 34359738368

with time:100:

  id  | name |          dataflow_name          | count |    duration     | duration_ns
------+------+---------------------------------+-------+-----------------+-------------
 5347 | Join | Dataflow: materialize.public.mv |   295 | 00:00:00.134217 | 134217728

Motivation

This PR fixes a recognized bug.

Addresses MaterializeInc/database-issues#6761 and touches MaterializeInc/database-issues#6497.

Tips for reviewer

Changing yielding behavior of operators is always scary because it can have implications on flow control. That's why this PR introduces time-based yielding in the most careful way possible, by gating it behind a system var and keeping it off by default. We need to run further tests to validate the new yielding strategy. I assume that unbilled replicas will be very helpful for this!

Checklist

This PR has adequate test coverage / QA involvement has been duly considered.
This PR has an associated up-to-date design doc, is a design doc (template), or is sufficiently small to not require a design.
If this PR evolves an existing $T ⇔ Proto$T mapping (possibly in a backwards-incompatible way), then it is tagged with a T-proto label.
If this PR will require changes to cloud orchestration or tests, there is a companion cloud PR to account for those changes that is tagged with the release-blocker label (example).
This PR includes the following user-facing behavior changes:

benesch · 2023-10-14T17:14:38Z

Very excited to try this.

I assume that unbilled replicas will be very helpful for this!

Yes! Absolutely agree.

src/compute/src/render/join/mz_join_core.rs

vmarcos

The change itself looks fine to me; it is, though, appreciated that this is gated by a feature flag and disabled by default.

One of the risks we should probably get a feeling for is how much consolidation potential is left on the table when we force flushes to happen based on time. An initial evaluation of this risk can be done for example with either the scenario SkewedJoin of the feature benchmark or the internal repro from which it took inspiration (perhaps scaling the data size with the former is the easiest). It would be interesting to see the total number of messages as well as its distribution among workers. Using unbilled replicas to check the impact on similar scenarios after this initial validation is also a great idea.

Note about the above: In both SkewedJoin and the internal repro, it's important to have the entire schema created first and only then feed data through persist. This is to get repeatable message counts due to our consistency guarantees.

teskje · 2023-10-16T12:29:40Z

One of the risks we should probably get a feeling for is how much consolidation potential is left on the table when we force flushes to happen based on time.

Good callout! In general I would expect a switch to time-based yielding to (a) reduce throughput and (b) reduce consolidation of the join output. Of course that depends on all sort of things. For example, I assume that with time-based yielding we will yield more often than with work-based yielding, but that doesn't always need to be true.

I ran both the SkewedJoin feature benchmark and your internal repro with a yielding strategy of time:100 (yield at least every 100ms). These are the results:

`SkewedJoin` feature benchmark

NAME                                | TYPE      |      THIS       |      OTHER      |  Regression?  | 'THIS' is:
----------------------------------------------------------------------------------------------------
SkewedJoin                          | wallclock |           4.305 |           4.585 |      no       | 6.1 pct   less/faster
SkewedJoin                          | messages  |     8890511.000 |     8891858.000 |      no       | 0.0 pct   less/faster
SkewedJoin                          | memory    |        1086.235 |        1057.625 |      no       | 2.7 pct   more/slower

Higher memory usage is something I'd have expected (due to less consolidation), less time spent is surprising.
But given the high variability that the feature benchmarks usually have, I don't think the differences are significant here.

Internal repro

 worker_id | sum_sent
-----------+----------
 2         |    33250
 7         |    12060
 4         |    10120
 1         |     9830
 6         |     8980
 3         |     8880
 0         |     8600
 5         |     8280
(8 rows)

If I compare this to the results in your Slack message, it looks very close to what the DD join (or the fixed mz_join_core) produces.

vmarcos

Thanks for getting these additional numbers! It's indeed not guaranteed that we'd yield more often with the time-based policy, only when we are able to "recoup" a lot of fuel due to consolidation (because then the fuel-based strategy can catch more air to run, while time-based imposes a wall). So there should be a relationship between skew, data size, and time-to-yield (e.g., reducing the time-to-yield increases the risk that you hit the wall more often).

I think we have some evidence now, though, that hitting the time-based wall is not so easy. So I'd be fine with merging this PR, but I suggest that we keep to proceeding with caution (leaving this off by default, testing with unbilled replicas, etc).

teskje · 2023-10-16T14:33:49Z

Recording an insight from a discussion with @antiguru: We should consider yielding by both time and work:

Yielding by time ensures responsiveness of the replica.
Yielding by work ensures that downstream operators get opportunity to consume (and hopefully reduce) join outputs, and thereby can prevent OOMs.

I don't plan to make any changes to this PR, as we need to do some testing anyway, but a follow-up should extend the linear_join_yielding syntax to make it possible to configure both yielding strategies (e.g. time:100,work:1000000). The yield_fn already supports that.

antiguru · 2023-10-17T14:02:18Z

Once we were able to validate that time-based yielding doesn't produce significant regressions, we can make that the default.

I think we can go ahead with this PR (I'll review in a jiffy), but I don't think we should make this the default just yet. At the moment, the join implementation yields to give downstream operators the chance to consumer their inputs, avoiding OOMs because the join produced too much data. At the same time, the join needs to consume its inputs ASAP because every time it yields, more inputs might appear. Yielding more frequently can cause a different OOM behavior than what we've seen until now. For this reason, I'm not convinced that we should use time-based yielding by default.

Besides this, the fact that the join takes a long time to consume its inputs is clearly a problem. Changing the yield behavior mitigates the symptom, but does not solve the problem:

Why do we compute a cross-join? Can we offer help to rewrite the query?
Is MFP evaluation too slow? Should we invest in making it faster?
🌶️-take: we could only allow cross-joins if explicitly requested.

antiguru

Thanks, looks good. Left some comments!

src/compute/src/render/join/linear_join.rs

src/compute/src/render/join/mz_join_core.rs

shepherdlybot · 2023-10-17T15:34:05Z

This PR has higher risk. Make sure to carefully review the file hotspots. In addition to having this change reviewed, adequate tests should be considered and it may be useful to add observability and/or a feature flag. What's This?

Risk Score	Probability	Buggy File Hotspots
🔴 80 / 100	60%	1

Buggy File Hotspots:

File	Percentile
../session/vars.rs	98

teskje · 2023-10-17T15:46:39Z

Besides this, the fact that the join takes a long time to consume its inputs is clearly a problem. Changing the yield behavior mitigates the symptom, but does not solve the problem:

Why do we compute a cross-join? Can we offer help to rewrite the query?

Is MFP evaluation too slow? Should we invest in making it faster?

🌶️-take: we could only allow cross-joins if explicitly requested.

We should consider all these things, but I think we'll still need time-based yielding (in addition to work-based) because we won't be able to rule out long-running joins entirely. Cross-joins are the easiest way to produce them, but any equi-join will become a cross-join if the data is just skewed enough. So we'd also need to think about preventing data skew in the join inputs. And even if we manage that, a join will still be slow if it the amount of data it has to crunch through is just large enough.

This commit provides `mz_join_core` with a yield functions, inspired by `persist_source` and DD's half join operator, that allows the caller to control the yield behavior by time or by amount of work performed.

This commit introduces the plumbing required to allow users of Compute to specify the yielding behavior of linear join operators via `UpdateConfiguration` commands.

This commit adds a new `SystemVar` called `linear_join_yielding` that can be used to control the yielding behavior of linear joins rendered by the compute layer.

teskje · 2023-10-18T10:05:56Z

TFTRs!

teskje marked this pull request as ready for review October 14, 2023 13:38

teskje requested review from a team, antiguru and vmarcos October 14, 2023 13:38

teskje mentioned this pull request Oct 14, 2023

storage: make decode_and_mfp yield more often #22358

Merged

5 tasks

antiguru reviewed Oct 15, 2023

View reviewed changes

src/compute/src/render/join/mz_join_core.rs Show resolved Hide resolved

vmarcos reviewed Oct 16, 2023

View reviewed changes

vmarcos approved these changes Oct 16, 2023

View reviewed changes

antiguru approved these changes Oct 17, 2023

View reviewed changes

src/compute/src/render/join/linear_join.rs Show resolved Hide resolved

src/compute/src/render/join/mz_join_core.rs Outdated Show resolved Hide resolved

teskje force-pushed the time-based-join-fueling branch from b232bb0 to 6407410 Compare October 17, 2023 15:32

teskje force-pushed the time-based-join-fueling branch from 6407410 to b667fa4 Compare October 17, 2023 16:27

vmarcos self-assigned this Oct 17, 2023

teskje added 3 commits October 18, 2023 10:37

compute: controllable mz_join_core yielding

53db52f

This commit provides `mz_join_core` with a yield functions, inspired by `persist_source` and DD's half join operator, that allows the caller to control the yield behavior by time or by amount of work performed.

compute: allow specification of join yielding

52827e0

This commit introduces the plumbing required to allow users of Compute to specify the yielding behavior of linear join operators via `UpdateConfiguration` commands.

adapter: add SystemVar linear_join_yielding

a8c2cba

This commit adds a new `SystemVar` called `linear_join_yielding` that can be used to control the yielding behavior of linear joins rendered by the compute layer.

teskje force-pushed the time-based-join-fueling branch from b667fa4 to a8c2cba Compare October 18, 2023 08:38

teskje merged commit 2fdb728 into MaterializeInc:main Oct 18, 2023

teskje deleted the time-based-join-fueling branch October 18, 2023 10:05

vmarcos removed their assignment Oct 18, 2023

teskje mentioned this pull request Oct 24, 2023

Allow join yielding by work and time simultaneously #22610

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

compute: time based linear join yielding #22391

compute: time based linear join yielding #22391

Uh oh!

teskje commented Oct 14, 2023 •

edited

Loading

Uh oh!

benesch commented Oct 14, 2023

Uh oh!

Uh oh!

vmarcos left a comment

Uh oh!

teskje commented Oct 16, 2023

Uh oh!

vmarcos left a comment

Uh oh!

teskje commented Oct 16, 2023 •

edited

Loading

Uh oh!

antiguru commented Oct 17, 2023

Uh oh!

antiguru left a comment

Uh oh!

Uh oh!

Uh oh!

shepherdlybot bot commented Oct 17, 2023 •

edited

Loading

Uh oh!

teskje commented Oct 17, 2023 •

edited

Loading

Uh oh!

teskje commented Oct 18, 2023

Uh oh!

Uh oh!

compute: time based linear join yielding #22391

compute: time based linear join yielding #22391

Uh oh!

Conversation

teskje commented Oct 14, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Example

Motivation

Tips for reviewer

Checklist

Uh oh!

benesch commented Oct 14, 2023

Uh oh!

Uh oh!

vmarcos left a comment

Choose a reason for hiding this comment

Uh oh!

teskje commented Oct 16, 2023

SkewedJoin feature benchmark

Internal repro

Uh oh!

vmarcos left a comment

Choose a reason for hiding this comment

Uh oh!

teskje commented Oct 16, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

antiguru commented Oct 17, 2023

Uh oh!

antiguru left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

shepherdlybot bot commented Oct 17, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

teskje commented Oct 17, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

teskje commented Oct 18, 2023

Uh oh!

Uh oh!

teskje commented Oct 14, 2023 •

edited

Loading

`SkewedJoin` feature benchmark

teskje commented Oct 16, 2023 •

edited

Loading

shepherdlybot bot commented Oct 17, 2023 •

edited

Loading

teskje commented Oct 17, 2023 •

edited

Loading