GPU Uncoalesced Bandwidth Benchmark

1. Benchmark Explanation

This benchmark compares the memory bandwidth (BW) of several access patterns using Kokkos:

Coalesced Read/Write (Baseline)

KOKKOS_LAMBDA(const _SIZE_ i) {
    write(i) = read(i);
}

Uncoalesced Read

KOKKOS_LAMBDA(const _SIZE_ i) {
    _SIZE_ rindex = indirections(i);
    write(i) = read(rindex);
}

Uncoalesced Write

KOKKOS_LAMBDA(const _SIZE_ i) {
    _SIZE_ rindex = indirections(i);
    write(rindex) = read(i);
}

The benchmark varies the random indirection size from 0 to the full vector size. The vector size is 2^28 doubles, which should fully utilize GPU memory bandwidth: Increasing the size beyond this point no longer improves throughput.

2. Benchmark Results

All plots generated from the benchmark are saved in the results/ folder. Each plot corresponds to a GPU and visualizes:

Uncoalesced Read / Coalesced ratio
Uncoalesced Write / Coalesced ratio

Note: Coalesced BW exceeds H100 BW (2TB/s), i was unable to erase cache... . I suspect A100 BW is skewed too, great caches ! :)

Intel results: courtesy of Daniel Arndt !

3. Observation / Interpretation

For all architectures, uncoalesced read operations are more expensive than uncoalesced write operations. This is not surprising as it is a well known fact that read is more expensive than write. However, we observe here that the ratio gets worse as the inderection grows.

There is an exception for MI300A with the read becoming more expensive than the write for very large indirections.

The way the performance decreases with the size of the indirection is a watermark of the cache structure. We globally observe two plateau that I think correspond to L1 cache misses then L2 cache misses respectively. The compressible gas dynamics enthusiast will see the striking resemblance with shock tube profiles !

Florent Duguet @ Nvidia: During non-coalesced read access, the various caches are populated with data, but different SMs accessing different data from the same cache line only replicate the read (from L2 cache to L1 cache). In the case of a write access, the L1 caches must be merged into the L2 cache to maintain coherence (i.e. L1 cache is write through), which requires additional processing by the memory units.

4. Compilation and Run Instructions

Requirements

Python 3 with numpy and matplotlib
CMake
Kokkos-enabled GPU (NVIDIA or AMD)

Build Steps

mkdir build
cd build
cmake -DKokkos_ENABLE_CUDA=ON ..        # for NVIDIA GPUs
# OR
cmake -DKokkos_ENABLE_HIP=ON ..         # for AMD GPUs
make

Run the Benchmark

./../run-bench.sh

The plot will be saved as results/GPU_NAME_ratios.png.

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
jobs		jobs
results		results
CMakeLists.txt		CMakeLists.txt
README.md		README.md
compute_rindex.h		compute_rindex.h
device_info.h		device_info.h
main.cpp		main.cpp
plot.py		plot.py
quantities.h		quantities.h
run-bench.sh		run-bench.sh
run_and_time_kernel.h		run_and_time_kernel.h
scale_test.h		scale_test.h

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

GPU Uncoalesced Bandwidth Benchmark

1. Benchmark Explanation

Coalesced Read/Write (Baseline)

Uncoalesced Read

Uncoalesced Write

2. Benchmark Results

Intel results: courtesy of Daniel Arndt !

3. Observation / Interpretation

4. Compilation and Run Instructions

Requirements

Build Steps

Run the Benchmark

About

Uh oh!

Releases

Packages

Languages

rbourgeois33/coalesced_vs_uncoalesced_bench

Folders and files

Latest commit

History

Repository files navigation

GPU Uncoalesced Bandwidth Benchmark

1. Benchmark Explanation

Coalesced Read/Write (Baseline)

Uncoalesced Read

Uncoalesced Write

2. Benchmark Results

Intel results: courtesy of Daniel Arndt !

3. Observation / Interpretation

4. Compilation and Run Instructions

Requirements

Build Steps

Run the Benchmark

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages