-
Notifications
You must be signed in to change notification settings - Fork 473
doc/design: A Small Coordinator for a More Scalable and Isolated Materialize #33082
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
81383b7
to
211b670
Compare
d833825
to
f16247d
Compare
has too be involved when absolutely necessary. A good analogy might be CISC vs | ||
RISC instruction sets, where CISC has fewer, more complex opcodes and RISC has | ||
possibly more, but simpler opcodes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: I think you mixed up the analogy here! CISC ISAs usually have more opcodes than RISC ISAs.
I think you do want the coordinator to be "RISC" in terms of the complexity (less) and number (fewer) of commands. But RISC machines also have to execute more commands than CISC machines to implement the same logic, whereas I think we also want the coordinator to execute fewer commands than it does now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I messed up that one part. CISC is actually more opcodes and the opcodes are more complicated. So the analogy works quite well.
Where before the frontend would send 1 EXECUTE SELECT, it would send the smaller GET CATALOG, GET THIS CLIENT, GET THAT CLIENT messages. But the frontend would also keep those clients around so doesn't have to run those commands for every peek.
- `controller_ready(compute)`: the compute controller signaling that a peek | ||
result is ready and the Coordinator needs to act. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both controllers also become ready when frontiers advance. Probably doesn't show up here because it gets drowned out by the peek responses, but it is also something we should fix. The reason we wake up on all frontier changes is that there might have been watches registered for query lifecycle tracking and we need to check those somewhere. But we shouldn't check them on the coordinator main loop.
controller](https://github.com/MaterializeInc/materialize/pull/29559). With | ||
that work fully realized, both for the storage controller and the remaining | ||
compute controller moments a visualization of the workflow would look like | ||
this: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it already basically looks like this for the compute controller. At least it doesn't really do anything on process
, just maybe sends a "run maintenance" command to the instance tasks and then returns a stashed response, if it has any.
Sending the "run maintenance" command is cheap. But we can also remove it, I think, by giving each instance task its own maintenance ticker.
layer. | ||
- We don't want to improve throughput benchmark numbers, only remove the | ||
Coordinator as a bottleneck. Our work might increase throughput numbers or | ||
reveal similar bottlenecks in other parts of the system. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think moving the current work of peek sequencing to the frontend could increase QPS even without horizontal scaling, because:
- Doing most of the peek sequencing work on the per-session frontend task would allow us to at least vertically scale envd inside a single machine when a particular user wants a bit more QPS.
- I imagine that the current staged execution has some overhead, which would disappear if we had straight-line code instead.
Rendered: https://github.com/aljoscha/materialize/blob/design-small-coordinator/doc/developer/design/20250717_a_small_coordinator_more_scalable_isolated_materialize.md