How is a data flow partitioned? #1828

ydewit · 2025-04-09T19:40:57Z

ydewit
Apr 9, 2025

Partitioning is the first compilation step performed on a data flow graph by the Hydroflow compiler. Its primary purpose appears to be identifying how much of the data flow can be executed efficiently through compilation techniques (e.g., function calls, iterators, inlining, monomorphization) rather than explicit scheduling. One of the architecture documents in the repository describes this concept as a continuum between “more scheduled” and “more compiled” data flow execution. At one extreme, each partition contains a single operator, resulting in higher overhead but greater flexibility. At the other extreme, the entire flow is compiled into a single sequence of Rust iterators, offering lower overhead but reduced flexibility.

The first dimension of partitioning is likely the Location, as it defines structural boundaries within the flow.

The second dimension appears to convert the data flow into a set of trees, characterized by unique paths between operators. Interestingly, these trees can take several forms: in-trees, out-trees, in-out trees (trees with a common root), or even poly-trees.

Another dimension is the concept of strata, which further divides the flow into strictly ordered phases: all sub-flows within stratum 0 must complete before those in stratum 1 begin execution.

Questions:

Are there additional dimensions of partitioning beyond these?
Within a single stratum, it seems possible to have multiple independent sub-flows in the same Process. Given that a Process is single-threaded, can these independent sub-flows execute in parallel?

MingweiSamuel · 2025-04-09T19:57:55Z

MingweiSamuel
Apr 9, 2025
Maintainer

Are there additional dimensions of partitioning beyond these?

Yes so there are (at least) three layers now

1. At the hydro_lang layer, the graph is divided into different locations. This is fully outside of dfir_rs/lang

Then once we are on a single node there are two layers, within dfir_rs/lang:

1. The "scheduled layer" schedules subgraphs (compiled components)
1. The "compiled layer" combines multiple operators together within a single subgraph

The continuum is between these two layers. At the dynamic extreme, you could make each "subgraph" just a single operator (like naive Timely dataflow). At the compiled extreme, you could compile the graph down into a single "subgraph" consisting of a bunch of sequential or nested for-loops which do everything for the entire graph (though this is non-trivial to actually do).

We landed somewhere in the middle, where the graph is divided into the largest possible "in-out trees," and those become the subgraphs.

Strata have a bit of cross-cutting concerns. Because each strata needs to run before the other, that means we need to be able to schedule them as such, and therefore different strata must be in different subgraphs. So you could think of strata as another layer in-between. That being said, with flo semantics we're moving away from using strata and it may be removed as a feature/concept soon. The replacement are loop { ... } blocks which can serve a similar role and can also be thought of as an in-between layer.

Within a single stratum, it seems possible to have multiple independent sub-flows in the same Process. Given that a Process is single-threaded, can these independent sub-flows execute in parallel?

No, all "processes" are single-threaded, which is an intentional design choice to support a shared-nothing architecture. Multiple hydro "processes" can run on the same machine (or even in the same OS process, at least we want to be able to support that in the future), but must always communicate via channels. That being said, we may loosen that requirement if we have good reason to.

1 reply

ydewit Apr 9, 2025
Author

At the hydro_lang layer, the graph is divided into different locations. This is fully outside of dfir_rs/lang

I see: hydro_lang is layering the "distributed" part on top of dfir.

This may possibly come at the expense of some interesting optimizations that may be more global like placing operators closer to data sources or sinks, for instance, should that use case arise. But I digress ...

Strata have a bit of cross-cutting concerns. Because each strata needs to run before the other, that means we need to be able to schedule them as such, and therefore different strata must be in different subgraphs. So you could think of strata as another layer in-between. That being said, with flo semantics we're moving away from using strata and it may be removed as a feature/concept soon. The replacement are loop { ... } blocks which can serve a similar role and can also be thought of as an in-between layer.

One example I have in mind for the need for strata is a filter followed by a sort operator (assuming the stream is unordered). Since the sort operator must wait for all input before it can produce any output, it makes sense to introduce a partition at that point. But this seems to rely on an explicit list of operators that trigger partitioning. Is the new loop {} block essentially making this partition point explicit instead? I need to revisit the Flo paper.

Possibly related—but I’m not sure: I was intrigued by the filter-then-sort scenario. If a sort operator triggers a stratum partition and thus forces a handoff between filter and sort, what about a case where the sort operator maintains internal state and performs incremental sorting, only releasing outputs once all inputs have been received? I may be making too many assumptions here without a better understanding of the internals.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

How is a data flow partitioned? #1828

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

How is a data flow partitioned? #1828

Uh oh!

Uh oh!

ydewit Apr 9, 2025

Replies: 1 comment · 1 reply

Uh oh!

MingweiSamuel Apr 9, 2025 Maintainer

Are there additional dimensions of partitioning beyond these?

Within a single stratum, it seems possible to have multiple independent sub-flows in the same Process. Given that a Process is single-threaded, can these independent sub-flows execute in parallel?

Uh oh!

ydewit Apr 9, 2025 Author

ydewit
Apr 9, 2025

Replies: 1 comment 1 reply

MingweiSamuel
Apr 9, 2025
Maintainer

ydewit Apr 9, 2025
Author