This benchmark compares the memory bandwidth (BW) of several access patterns using Kokkos:
KOKKOS_LAMBDA(const _SIZE_ i) {
write(i) = read(i);
}
KOKKOS_LAMBDA(const _SIZE_ i) {
_SIZE_ rindex = indirections(i);
write(i) = read(rindex);
}
KOKKOS_LAMBDA(const _SIZE_ i) {
_SIZE_ rindex = indirections(i);
write(rindex) = read(i);
}
The benchmark varies the random indirection size from 0
to the full vector size. The vector size is 2^28 doubles, which should fully utilize GPU memory bandwidth: Increasing the size beyond this point no longer improves throughput.
All plots generated from the benchmark are saved in the results/
folder. Each plot corresponds to a GPU and visualizes:
- Uncoalesced Read / Coalesced ratio
- Uncoalesced Write / Coalesced ratio
Note: Coalesced BW exceeds H100 BW (2TB/s), i was unable to erase cache... . I suspect A100 BW is skewed too, great caches ! :)
For all architectures, uncoalesced read operations are more expensive than uncoalesced write operations. This is not surprising as it is a well known fact that read is more expensive than write. However, we observe here that the ratio gets worse as the inderection grows.
There is an exception for MI300A with the read becoming more expensive than the write for very large indirections.
The way the performance decreases with the size of the indirection is a watermark of the cache structure. We globally observe two plateau that I think correspond to L1 cache misses then L2 cache misses respectively. The compressible gas dynamics enthusiast will see the striking resemblance with shock tube profiles !
Florent Duguet @ Nvidia: During non-coalesced read access, the various caches are populated with data, but different SMs accessing different data from the same cache line only replicate the read (from L2 cache to L1 cache). In the case of a write access, the L1 caches must be merged into the L2 cache to maintain coherence (i.e. L1 cache is write through), which requires additional processing by the memory units.
- Python 3 with
numpy
andmatplotlib
- CMake
- Kokkos-enabled GPU (NVIDIA or AMD)
mkdir build
cd build
cmake -DKokkos_ENABLE_CUDA=ON .. # for NVIDIA GPUs
# OR
cmake -DKokkos_ENABLE_HIP=ON .. # for AMD GPUs
make
./../run-bench.sh
The plot will be saved as results/GPU_NAME_ratios.png
.