Open
Description
GemmKernels.jl is shaping up to be usable for replacing the GPUArrays fallback matmul implementation, which much better performance. For example a 2048x2048x2048 Float32 matmul.
GPUArrays:
julia> @benchmark CUDA.@sync GPUArrays.generic_matmatmul!(dC, dA, dB, true, false)
BenchmarkTools.Trial: 576 samples with 1 evaluation.
Range (min … max): 6.001 ms … 9.519 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 8.698 ms ┊ GC (median): 0.00%
Time (mean ± σ): 8.677 ms ± 352.550 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▃ █▅▁
▂▂▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▂▁▁▂▁▁▁▁▁▁▁▁▁▁▂▁▁▁▁▁▁▁▂▂▃▃▃▄▅▅█▇████▅▅▄▃▃▃ ▃
6 ms Histogram: frequency by time 9.22 ms <
Memory estimate: 4.81 KiB, allocs estimate: 88.
GemmKernels:
julia> @benchmark CUDA.@sync GemmKernels.mul!(dC, dA, dB)
BenchmarkTools.Trial: 7421 samples with 1 evaluation.
Range (min … max): 579.075 μs … 7.580 ms ┊ GC (min … max): 0.00% … 81.60%
Time (median): 670.235 μs ┊ GC (median): 0.00%
Time (mean ± σ): 669.440 μs ± 81.284 μs ┊ GC (mean ± σ): 0.12% ± 0.95%
▁ ▂█▅▂
▂▂▂▂▂▂▂▂▂▂▂▂▁▁▁▁▂▁▁▂▂▁▁▁▁▁▁▁▁▂▂▂▂▂▂▂▁▂▂▃▂▃▆▆█▄▅████▆▄▃▃▃▂▂▂▂ ▃
579 μs Histogram: frequency by time 692 μs <
Memory estimate: 6.34 KiB, allocs estimate: 115.
CUBLAS:
julia> @benchmark CUDA.@sync mul!(dC, dA, dB)
BenchmarkTools.Trial: 9659 samples with 1 evaluation.
Range (min … max): 404.287 μs … 593.765 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 513.306 μs ┊ GC (median): 0.00%
Time (mean ± σ): 512.696 μs ± 10.767 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▁ ▁ ▁▆▇██▇▅▃▃▂▁ ▂
▄██▆▃▁▁▁▁▃▁▁▁▁▁▃▅▁▁▁▁▁▃▃▁▁▁▅█▄▁▃▁▁▃▁▅█▅▄▄▆███████████████████ █
404 μs Histogram: log(frequency) by time 530 μs <
Memory estimate: 3.81 KiB, allocs estimate: 84.
And for completion, OpenBLAS:
julia> @benchmark mul!(C, A, B)
BenchmarkTools.Trial: 376 samples with 1 evaluation.
Range (min … max): 12.956 ms … 15.188 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 13.197 ms ┊ GC (median): 0.00%
Time (mean ± σ): 13.294 ms ± 356.341 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▁▅▄▇███▇▅▅▃▂▅▂▂▁
████████████████▆▁▆▁▅▆▆▅▅▅▁▇▁█▅▅▁▅▅▁▁▅▁▁▁▁▁▁▅▁▅▅▅▁▁▁▁▁▅▅▁▁▁▅ ▇
13 ms Histogram: log(frequency) by time 15.1 ms <
Memory estimate: 0 bytes, allocs estimate: 0.