This is an implementaion of threadgroup wide bitonic sort in HLSL.
Sometimes, it is desired to sort elements within a thread group on the GPU. The threadgroup_bitonic_sort.hlsli
header file provides multiple variants of the bitonic sort to support any power-of-2 threadgroup size and the number of sortable elements of up to 4096.
- It is agnostic of wave/warp sizes
- It automatically switches to sorting and shuffling within waves/warps by utilising wave intrinsics when the sizes of sorted/shuffled blocks become smaller than the size of waves/warps in a threadgroup (check out AMD RGA codegen on godbolt.org)
- It supports GPUs without wave intrinsic support
- It supports sorting of up to 4096 elements within a thread group (sorting 4096 elements requires the size of a thread group to be 1024 threads)
- For a thread group with
N
threads, it supports sorting ofN
,N * 2
orN * 4
elements
To build demo.cpp
, run build.bat
from Visual Studio Command Prompt. The batch file should automatically download the required packages (D3D12, DXC), build and run all shader variants as benchmarks.
The header file can be compiled with DX Compiler release for February 2025 or earlier.
This header file is available to anybody free of charge, under the terms of MIT License (see LICENSE.md).