You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+213-2Lines changed: 213 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,6 +1,6 @@
1
1
# Fork Union 🍴
2
2
3
-
Fork Union is arguably the lowest-latency OpenMP-style NUMA-aware minimalistic scoped thread-pool designed for 'Fork-Join' parallelism in C++, C, and Rust, avoiding × [mutexes & system calls](#locks-and-mutexes), × [dynamic memory allocations](#memory-allocations), × [CAS-primitives](#atomics-and-cas), and × [false-sharing](#alignment--false-sharing) of CPU cache-lines on the hot path 🍴
3
+
Fork Union is arguably the lowest-latency OpenMP-style NUMA-aware minimalistic scoped thread-pool designed for 'Fork-Join' parallelism in C++, C, Rust, and Zig, avoiding × [mutexes & system calls](#locks-and-mutexes), × [dynamic memory allocations](#memory-allocations), × [CAS-primitives](#atomics-and-cas), and × [false-sharing](#alignment--false-sharing) of CPU cache-lines on the hot path 🍴
4
4
5
5
## Motivation
6
6
@@ -13,7 +13,7 @@ OpenMP, however, is not ideal for fine-grained parallelism and is less portable
It's a C++ 17 library with C 99and Rust bindings ([previously Rust implementation was standalone in v1](#why-not-reimplement-it-in-rust)).
16
+
It's a C++ 17 library with C 99, Rust, and Zig bindings ([previously Rust implementation was standalone in v1](#why-not-reimplement-it-in-rust)).
17
17
It supports pinning threads to specific [NUMA](https://en.wikipedia.org/wiki/Non-uniform_memory_access) nodes or individual CPU cores, making it much easier to ensure data locality and halving the latency of individual loads in Big Data applications.
18
18
19
19
## Basic Usage
@@ -179,13 +179,199 @@ int main() {
179
179
For advanced usage, refer to the [NUMA section below](#non-uniform-memory-access-numa).
180
180
NUMA detection on Linux defaults to AUTO. Override with `-D FORK_UNION_ENABLE_NUMA=ON` or `OFF`.
181
181
182
+
### Intro in Zig
183
+
184
+
To integrate into your Zig project, add Fork Union to your `build.zig.zon`:
Unlike `std.Thread.Pool` task queue for async work, Fork Union is designed for __data parallelism__
223
+
and __tight parallel loops__ — think OpenMP's `#pragma omp parallel for` with zero allocations on the hot path.
224
+
225
+
### Intro in C
226
+
227
+
Fork Union provides a pure C99 API via `fork_union.h`, wrapping the C++ implementation in pre-compiled libraries: `fork_union_static.a` or `fork_union_dynamic.so`.
228
+
The C API uses opaque `fu_pool_t` handles and function pointers for callbacks, making it compatible with any C99+ compiler.
Those are not designed for the same OpenMP-like use cases as __`fork_union`__.
191
377
Instead, they primarily focus on task queuing, which requires significantly more work.
@@ -461,6 +647,17 @@ Rust benchmarking results for $N=128$ bodies and $I=1e6$ iterations:
461
647
> ² When a combination of performance and efficiency cores is used, dynamic stealing may be more efficient than static slicing. It's also fair to say, that OpenMP is not optimized for AppleClang.
462
648
> 🔄 Rotation emoji stands for iterators, the default way to use Rayon and the opt-in slower, but more convenient variant for Fork Union.
463
649
650
+
Zig benchmarking results for $N=128$ bodies and $I=1e6$ iterations:
651
+
652
+
| Machine | Standard (S) | Fork Union (D) | Fork Union (S) |
> Benchmarking suite also includes [Spice](https://github.com/judofyr/spice) and [libXEV](https://github.com/mitchellh/libxev), two popular Zig libraries for async processing, but those don't provide comparable bulk-synchronous APIs.
659
+
> Thus, typically, all of the submitted tasks are executed on a single thread, making results not comparable.
660
+
464
661
You can rerun those benchmarks with the following commands:
0 commit comments