Skip to content

Commit 6955c34

Browse files
committed
Merge: Iterators & Docs
2 parents 5e0d35d + ef0ae95 commit 6955c34

File tree

5 files changed

+1233
-30
lines changed

5 files changed

+1233
-30
lines changed

README.md

Lines changed: 71 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -106,6 +106,7 @@ fn main() -> Result<(), Box<dyn Error>> {
106106
```
107107

108108
For advanced usage, refer to the [NUMA section below](#non-uniform-memory-access-numa).
109+
For convenience Rayon-style parallel iterators pull the `prelude` module and [check out related examples](#rayon-style-parallel-iterators).
109110

110111
### Intro in C++
111112

@@ -369,6 +370,68 @@ No kernel calls.
369370
No futexes.
370371
Works in tight loops.
371372

373+
### Rayon-style Parallel Iterators
374+
375+
For Rayon-style ergonomics, use the parallel iterator API with the `prelude`.
376+
Unlike Rayon, Fork Union's parallel iterators don't depend on the global state and allow explicit control over the thread pool and scheduling strategy.
377+
For statically shaped workloads, the default static scheduling is more efficient:
378+
379+
```rust
380+
use fork_union as fu;
381+
use fork_union::prelude::*;
382+
383+
let mut pool = fu::spawn(4);
384+
let mut data: Vec<usize> = (0..1000).collect();
385+
386+
(&data[..])
387+
.into_par_iter()
388+
.with_pool(&mut pool)
389+
.for_each(|value| {
390+
println!("Value: {}", value);
391+
});
392+
```
393+
394+
For dynamic work-stealing, use `with_schedule` with `DynamicScheduler`:
395+
396+
```rust
397+
(&mut data[..])
398+
.into_par_iter()
399+
.with_schedule(&mut pool, DynamicScheduler)
400+
.for_each(|value| {
401+
*value *= 2;
402+
});
403+
```
404+
405+
This easily composes with other iterator adaptors, like `map`, `filter`, and `zip`:
406+
407+
```rust
408+
(&data[..])
409+
.into_par_iter()
410+
.filter(|&x| x % 2 == 0)
411+
.map(|x| x * x)
412+
.with_pool(&mut pool)
413+
.for_each(|value| {
414+
println!("Squared even: {}", value);
415+
});
416+
```
417+
418+
Moreover, each thread can maintain its own scratch space to avoid contention during reductions.
419+
Cache-line alignment via `CacheAligned` prevents false sharing:
420+
421+
```rust
422+
// Cache-line aligned wrapper to prevent false sharing
423+
let mut scratch: Vec<CacheAligned<usize>> =
424+
(0..pool.threads()).map(|_| CacheAligned(0)).collect();
425+
426+
(&data[..])
427+
.into_par_iter()
428+
.with_pool(&mut pool)
429+
.fold_with_scratch(scratch.as_mut_slice(), |acc, value, _prong| {
430+
acc.0 += *value;
431+
});
432+
let total: usize = scratch.iter().map(|a| a.0).sum();
433+
```
434+
372435
## Performance
373436

374437
One of the most common parallel workloads is the N-body simulation ¹.
@@ -383,19 +446,20 @@ C++ benchmarking results for $N=128$ bodies and $I=1e6$ iterations:
383446
| Machine | OpenMP (D) | OpenMP (S) | Fork Union (D) | Fork Union (S) |
384447
| :------------- | ---------: | ---------: | -------------: | -------------: |
385448
| 16x Intel SPR | 20.3s | 16.0s | 18.1s | 10.3s |
386-
| 12x Apple M2 | ? | 1m:16.7s | 1m:30.3s ² | 1m:40.7s ² |
449+
| 12x Apple M2 | 1m:34.8s ² | 1m:25.9s ² | 31.5s | 20.3s |
387450
| 96x Graviton 4 | 32.2s | 20.8s | 39.8s | 26.0s |
388451

389452
Rust benchmarking results for $N=128$ bodies and $I=1e6$ iterations:
390453

391-
| Machine | Rayon (D) | Rayon (S) | Fork Union (D) | Fork Union (S) |
392-
| :------------- | --------: | --------: | -------------: | -------------: |
393-
| 16x Intel SPR | 51.4s | 38.1s | 15.9s | 9.8s |
394-
| 12x Apple M2 | 3m:23.5s | 2m:0.6s | 4m:8.4s | 1m:20.8s |
395-
| 96x Graviton 4 | 2m:13.9s | 1m:35.6s | 18.9s | 10.1s |
454+
| Machine | Rayon (D) | Rayon (S) | Fork Union (D) | Fork Union (S) |
455+
| :------------- | ---------: | ---------: | -------------: | -------------: |
456+
| 16x Intel SPR | 🔄 51.4s | 🔄 38.1s | 15.9s | 9.8s |
457+
| 12x Apple M2 | 🔄 1m:47.8s | 🔄 1m:07.1s | 24.5s, 🔄 26.8s | 11.0s, 🔄 11.8s |
458+
| 96x Graviton 4 | 🔄 2m:13.9s | 🔄 1m:35.6s | 18.9s | 10.1s |
396459

397460
> ¹ Another common workload is "Parallel Reductions" covered in a separate [repository](https://github.com/ashvardanian/ParallelReductionsBenchmark).
398-
> ² When a combination of performance and efficiency cores is used, dynamic stealing may be more efficient than static slicing.
461+
> ² When a combination of performance and efficiency cores is used, dynamic stealing may be more efficient than static slicing. It's also fair to say, that OpenMP is not optimized for AppleClang.
462+
> 🔄 Rotation emoji stands for iterators, the default way to use Rayon and the opt-in slower, but more convenient variant for Fork Union.
399463
400464
You can rerun those benchmarks with the following commands:
401465

0 commit comments

Comments
 (0)