Skip to content

compute: time based linear join yielding #22391

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Oct 18, 2023

Conversation

teskje
Copy link
Contributor

@teskje teskje commented Oct 14, 2023

This PR adds support for configuring compute rendering so that it employs time-based (rather than work-based) yielding for linear joins implemented by mz_join_core.

The changes made here include:

  • Providing mz_join_core with a yield_fn that allows configuring the yielding strategy from outside (1st commit).
  • Plumbing a yield spec through compute so the mz_join_core yielding behavior can be configured through UpdateConfiguration commands (2nd commit).
  • Adding a new system var, linear_join_yielding, to allow changing the yield spec from LD (3rd commit).

So far the default is still the previous behavior of yielding after 1 million updates produced (i.e. work:1000000). Once we were able to validate that time-based yielding doesn't produce significant regressions, we can make that the default.

Example

As a data point that time-based yielding can improve things, consider the example from MaterializeInc/database-issues#6761:

CREATE TABLE t (a int);
CREATE MATERIALIZED VIEW mv AS SELECT (t1.a + t2.a) % 2 FROM t t1, t t2;
INSERT INTO t SELECT generate_series(1, 10000);

On my system, the join's duration histogram looks like this:

  • with work:1000000:
      id  | name |          dataflow_name          | count |    duration     | duration_ns
    ------+------+---------------------------------+-------+-----------------+-------------
     3126 | Join | Dataflow: materialize.public.mv |     1 | 00:00:34.359738 | 34359738368
    
  • with time:100:
      id  | name |          dataflow_name          | count |    duration     | duration_ns
    ------+------+---------------------------------+-------+-----------------+-------------
     5347 | Join | Dataflow: materialize.public.mv |   295 | 00:00:00.134217 | 134217728
    

Motivation

  • This PR fixes a recognized bug.

Addresses MaterializeInc/database-issues#6761 and touches MaterializeInc/database-issues#6497.

Tips for reviewer

Changing yielding behavior of operators is always scary because it can have implications on flow control. That's why this PR introduces time-based yielding in the most careful way possible, by gating it behind a system var and keeping it off by default. We need to run further tests to validate the new yielding strategy. I assume that unbilled replicas will be very helpful for this!

Checklist

  • This PR has adequate test coverage / QA involvement has been duly considered.
  • This PR has an associated up-to-date design doc, is a design doc (template), or is sufficiently small to not require a design.
  • If this PR evolves an existing $T ⇔ Proto$T mapping (possibly in a backwards-incompatible way), then it is tagged with a T-proto label.
  • If this PR will require changes to cloud orchestration or tests, there is a companion cloud PR to account for those changes that is tagged with the release-blocker label (example).
  • This PR includes the following user-facing behavior changes:

@teskje teskje marked this pull request as ready for review October 14, 2023 13:38
@teskje teskje requested review from a team, antiguru and vmarcos October 14, 2023 13:38
@benesch
Copy link
Contributor

benesch commented Oct 14, 2023

Very excited to try this.

I assume that unbilled replicas will be very helpful for this!

Yes! Absolutely agree.

Copy link
Contributor

@vmarcos vmarcos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change itself looks fine to me; it is, though, appreciated that this is gated by a feature flag and disabled by default.

One of the risks we should probably get a feeling for is how much consolidation potential is left on the table when we force flushes to happen based on time. An initial evaluation of this risk can be done for example with either the scenario SkewedJoin of the feature benchmark or the internal repro from which it took inspiration (perhaps scaling the data size with the former is the easiest). It would be interesting to see the total number of messages as well as its distribution among workers. Using unbilled replicas to check the impact on similar scenarios after this initial validation is also a great idea.

Note about the above: In both SkewedJoin and the internal repro, it's important to have the entire schema created first and only then feed data through persist. This is to get repeatable message counts due to our consistency guarantees.

@teskje
Copy link
Contributor Author

teskje commented Oct 16, 2023

One of the risks we should probably get a feeling for is how much consolidation potential is left on the table when we force flushes to happen based on time.

Good callout! In general I would expect a switch to time-based yielding to (a) reduce throughput and (b) reduce consolidation of the join output. Of course that depends on all sort of things. For example, I assume that with time-based yielding we will yield more often than with work-based yielding, but that doesn't always need to be true.

I ran both the SkewedJoin feature benchmark and your internal repro with a yielding strategy of time:100 (yield at least every 100ms). These are the results:

SkewedJoin feature benchmark

NAME                                | TYPE      |      THIS       |      OTHER      |  Regression?  | 'THIS' is:
----------------------------------------------------------------------------------------------------
SkewedJoin                          | wallclock |           4.305 |           4.585 |      no       | 6.1 pct   less/faster
SkewedJoin                          | messages  |     8890511.000 |     8891858.000 |      no       | 0.0 pct   less/faster
SkewedJoin                          | memory    |        1086.235 |        1057.625 |      no       | 2.7 pct   more/slower

Higher memory usage is something I'd have expected (due to less consolidation), less time spent is surprising.
But given the high variability that the feature benchmarks usually have, I don't think the differences are significant here.

Internal repro

 worker_id | sum_sent
-----------+----------
 2         |    33250
 7         |    12060
 4         |    10120
 1         |     9830
 6         |     8980
 3         |     8880
 0         |     8600
 5         |     8280
(8 rows)

If I compare this to the results in your Slack message, it looks very close to what the DD join (or the fixed mz_join_core) produces.

Copy link
Contributor

@vmarcos vmarcos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for getting these additional numbers! It's indeed not guaranteed that we'd yield more often with the time-based policy, only when we are able to "recoup" a lot of fuel due to consolidation (because then the fuel-based strategy can catch more air to run, while time-based imposes a wall). So there should be a relationship between skew, data size, and time-to-yield (e.g., reducing the time-to-yield increases the risk that you hit the wall more often).

I think we have some evidence now, though, that hitting the time-based wall is not so easy. So I'd be fine with merging this PR, but I suggest that we keep to proceeding with caution (leaving this off by default, testing with unbilled replicas, etc).

@teskje
Copy link
Contributor Author

teskje commented Oct 16, 2023

Recording an insight from a discussion with @antiguru: We should consider yielding by both time and work:

  • Yielding by time ensures responsiveness of the replica.
  • Yielding by work ensures that downstream operators get opportunity to consume (and hopefully reduce) join outputs, and thereby can prevent OOMs.

I don't plan to make any changes to this PR, as we need to do some testing anyway, but a follow-up should extend the linear_join_yielding syntax to make it possible to configure both yielding strategies (e.g. time:100,work:1000000). The yield_fn already supports that.

@antiguru
Copy link
Member

Once we were able to validate that time-based yielding doesn't produce significant regressions, we can make that the default.

I think we can go ahead with this PR (I'll review in a jiffy), but I don't think we should make this the default just yet. At the moment, the join implementation yields to give downstream operators the chance to consumer their inputs, avoiding OOMs because the join produced too much data. At the same time, the join needs to consume its inputs ASAP because every time it yields, more inputs might appear. Yielding more frequently can cause a different OOM behavior than what we've seen until now. For this reason, I'm not convinced that we should use time-based yielding by default.

Besides this, the fact that the join takes a long time to consume its inputs is clearly a problem. Changing the yield behavior mitigates the symptom, but does not solve the problem:

  • Why do we compute a cross-join? Can we offer help to rewrite the query?
  • Is MFP evaluation too slow? Should we invest in making it faster?
  • 🌶️-take: we could only allow cross-joins if explicitly requested.

Copy link
Member

@antiguru antiguru left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, looks good. Left some comments!

@teskje teskje force-pushed the time-based-join-fueling branch from b232bb0 to 6407410 Compare October 17, 2023 15:32
@shepherdlybot
Copy link

shepherdlybot bot commented Oct 17, 2023

This PR has higher risk. Make sure to carefully review the file hotspots. In addition to having this change reviewed, adequate tests should be considered and it may be useful to add observability and/or a feature flag. What's This?

Risk Score Probability Buggy File Hotspots
🔴 80 / 100 60% 1
Buggy File Hotspots:
File Percentile
../session/vars.rs 98

@teskje
Copy link
Contributor Author

teskje commented Oct 17, 2023

Besides this, the fact that the join takes a long time to consume its inputs is clearly a problem. Changing the yield behavior mitigates the symptom, but does not solve the problem:

  • Why do we compute a cross-join? Can we offer help to rewrite the query?
  • Is MFP evaluation too slow? Should we invest in making it faster?
  • 🌶️-take: we could only allow cross-joins if explicitly requested.

We should consider all these things, but I think we'll still need time-based yielding (in addition to work-based) because we won't be able to rule out long-running joins entirely. Cross-joins are the easiest way to produce them, but any equi-join will become a cross-join if the data is just skewed enough. So we'd also need to think about preventing data skew in the join inputs. And even if we manage that, a join will still be slow if it the amount of data it has to crunch through is just large enough.

@teskje teskje force-pushed the time-based-join-fueling branch from 6407410 to b667fa4 Compare October 17, 2023 16:27
@vmarcos vmarcos self-assigned this Oct 17, 2023
This commit provides `mz_join_core` with a yield functions, inspired by
`persist_source` and DD's half join operator, that allows the caller to
control the yield behavior by time or by amount of work performed.
This commit introduces the plumbing required to allow users of Compute
to specify the yielding behavior of linear join operators via
`UpdateConfiguration` commands.
This commit adds a new `SystemVar` called `linear_join_yielding` that
can be used to control the yielding behavior of linear joins rendered by
the compute layer.
@teskje teskje force-pushed the time-based-join-fueling branch from b667fa4 to a8c2cba Compare October 18, 2023 08:38
@teskje teskje merged commit 2fdb728 into MaterializeInc:main Oct 18, 2023
@teskje
Copy link
Contributor Author

teskje commented Oct 18, 2023

TFTRs!

@teskje teskje deleted the time-based-join-fueling branch October 18, 2023 10:05
@vmarcos vmarcos removed their assignment Oct 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants