Description
From this very hacked branch:
https://github.com/CliMA/ClimaCore.jl/tree/ck/drop_field_dimension (PR #1929).
We can reach good bandwidth efficiency when combining linear indexing with dropped field dimensions on ClimaCore broadcasted objects for pointwise kernels (this is the thermo_bench_bw.jl
benchmark script):
Main branch (Clima A100):
[ Info: device = ClimaComms.CUDADevice()
Problem size: (63, 4, 4, 1, 5400), float_type = Float32, device_bandwidth_GBs=2039
┌────────────────────────────────────────────────────┬───────────────────────────────────┬─────────┬─────────────┬────────────────┬────────┐
│ funcs │ time per call │ bw % │ achieved bw │ n-reads/writes │ n-reps │
├────────────────────────────────────────────────────┼───────────────────────────────────┼─────────┼─────────────┼────────────────┼────────┤
│ TBB.thermo_func_bc!(x, us; nreps=100, bm) │ 796 microseconds, 877 nanoseconds │ 12.4798 │ 254.462 │ 10 │ 100 │
└────────────────────────────────────────────────────┴───────────────────────────────────┴─────────┴─────────────┴────────────────┴────────┘
[ Info: device = ClimaComms.CUDADevice()
Problem size: (63, 4, 4, 1, 5400), float_type = Float64, device_bandwidth_GBs=2039
┌────────────────────────────────────────────────────┬───────────────────────────────────┬─────────┬─────────────┬────────────────┬────────┐
│ funcs │ time per call │ bw % │ achieved bw │ n-reads/writes │ n-reps │
├────────────────────────────────────────────────────┼───────────────────────────────────┼─────────┼─────────────┼────────────────┼────────┤
│ TBB.thermo_func_bc!(x, us; nreps=100, bm) │ 1 millisecond, 43 microseconds │ 19.0568 │ 388.569 │ 10 │ 100 │
└────────────────────────────────────────────────────┴───────────────────────────────────┴─────────┴─────────────┴────────────────┴────────┘
Branch with dropped field dimension + linear indexing (Clima A100):
julia> using Revise; include(joinpath("benchmarks", "scripts", "thermo_bench_bw.jl"))
WARNING: replacing module ThermoBenchBandwidth.
[ Info: device = ClimaComms.CUDADevice()
[ Info: Success!
Problem size: (63, 4, 4, 1, 5400), float_type = Float32, device_bandwidth_GBs=2039
┌───────────────────────────────────────────────┬───────────────────────────────────┬─────────┬─────────────┬────────────────┬────────┐
│ funcs │ time per call │ bw % │ achieved bw │ n-reads/writes │ n-reps │
├───────────────────────────────────────────────┼───────────────────────────────────┼─────────┼─────────────┼────────────────┼────────┤
│ TBB.thermo_func_bc!(x, us; nreps=100, bm) │ 131 microseconds, 503 nanoseconds │ 75.6249 │ 1541.99 │ 10 │ 100 │
└───────────────────────────────────────────────┴───────────────────────────────────┴─────────┴─────────────┴────────────────┴────────┘
Problem size: (63, 4, 4, 1, 5400), float_type = Float64, device_bandwidth_GBs=2039
┌───────────────────────────────────────────────┬───────────────────────────────────┬─────────┬─────────────┬────────────────┬────────┐
│ funcs │ time per call │ bw % │ achieved bw │ n-reads/writes │ n-reps │
├───────────────────────────────────────────────┼───────────────────────────────────┼─────────┼─────────────┼────────────────┼────────┤
│ TBB.thermo_func_bc!(x, us; nreps=100, bm) │ 256 microseconds, 379 nanoseconds │ 77.5791 │ 1581.84 │ 10 │ 100 │
└───────────────────────────────────────────────┴───────────────────────────────────┴─────────┴─────────────┴────────────────┴────────┘
One future proofing complication of this branch is that we will need to continue to support the field dimension being present (perhaps inside TupleOfArrays
, or whatever we decide to call this new layer's struct) in order to still work reasonably with on the order of 100 tracers.
Just to note: dropping the field dimension roughly 2x
ed the performance, and using linear indexing accounted for the rest. As discussed with @tapios, only applying linear indexing seems to improve performance for broadcasting with single variables, but seems to degrade performance with multiple variables. So, it seems that both of these changes are needed in tandem to improve the performance.
cc @tapios
### Tasks
- [x] Refactor DataLayout internals to use `parent` less
- [x] Refactor DSS to use parent less, and leverage `UniversalSize`
- [x] Define `ArraySize` (similar to UniversalSize), that includes the field dimension
- [x] Define DataLayouts type parameter utilities? (e.g., `type_params`)
- [ ] Define a new layer, beneath DataLayouts, to store tuples of arrays
- [ ] Bypass Base.Broadcast's indexing to allow for linear indexing for pointwise kernels.
- [ ] https://github.com/CliMA/ClimaCore.jl/pull/1948
- [ ] https://github.com/CliMA/ClimaCore.jl/pull/1946
- [ ] https://github.com/CliMA/ClimaCore.jl/pull/1944
- [ ] https://github.com/CliMA/ClimaCore.jl/pull/1943
- [ ] https://github.com/CliMA/ClimaCore.jl/pull/1920
- [ ] https://github.com/CliMA/ClimaCore.jl/pull/1898