Skip to content

Use GemmKernels.jl for fallback mul! #1990

Open
@maleadt

Description

@maleadt

GemmKernels.jl is shaping up to be usable for replacing the GPUArrays fallback matmul implementation, which much better performance. For example a 2048x2048x2048 Float32 matmul.

GPUArrays:

julia> @benchmark CUDA.@sync GPUArrays.generic_matmatmul!(dC, dA, dB, true, false)
BenchmarkTools.Trial: 576 samples with 1 evaluation.
 Range (min … max):  6.001 ms …   9.519 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     8.698 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   8.677 ms ± 352.550 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

                                                 ▃ █▅▁
  ▂▂▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▂▁▁▂▁▁▁▁▁▁▁▁▁▁▂▁▁▁▁▁▁▁▂▂▃▃▃▄▅▅█▇████▅▅▄▃▃▃ ▃
  6 ms            Histogram: frequency by time        9.22 ms <

 Memory estimate: 4.81 KiB, allocs estimate: 88.

GemmKernels:

julia> @benchmark CUDA.@sync GemmKernels.mul!(dC, dA, dB)
BenchmarkTools.Trial: 7421 samples with 1 evaluation.
 Range (min … max):  579.075 μs …  7.580 ms  ┊ GC (min … max): 0.00% … 81.60%
 Time  (median):     670.235 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   669.440 μs ± 81.284 μs  ┊ GC (mean ± σ):  0.12% ±  0.95%

                                              ▁  ▂█▅▂
  ▂▂▂▂▂▂▂▂▂▂▂▂▁▁▁▁▂▁▁▂▂▁▁▁▁▁▁▁▁▂▂▂▂▂▂▂▁▂▂▃▂▃▆▆█▄▅████▆▄▃▃▃▂▂▂▂ ▃
  579 μs          Histogram: frequency by time          692 μs <

 Memory estimate: 6.34 KiB, allocs estimate: 115.

CUBLAS:

julia> @benchmark CUDA.@sync mul!(dC, dA, dB)
BenchmarkTools.Trial: 9659 samples with 1 evaluation.
 Range (min … max):  404.287 μs … 593.765 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     513.306 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   512.696 μs ±  10.767 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

                                            ▁    ▁ ▁▆▇██▇▅▃▃▂▁  ▂
  ▄██▆▃▁▁▁▁▃▁▁▁▁▁▃▅▁▁▁▁▁▃▃▁▁▁▅█▄▁▃▁▁▃▁▅█▅▄▄▆███████████████████ █
  404 μs        Histogram: log(frequency) by time        530 μs <

 Memory estimate: 3.81 KiB, allocs estimate: 84.

And for completion, OpenBLAS:

julia> @benchmark mul!(C, A, B)
BenchmarkTools.Trial: 376 samples with 1 evaluation.
 Range (min … max):  12.956 ms …  15.188 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     13.197 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   13.294 ms ± 356.341 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▁▅▄▇███▇▅▅▃▂▅▂▂▁
  ████████████████▆▁▆▁▅▆▆▅▅▅▁▇▁█▅▅▁▅▅▁▁▅▁▁▁▁▁▁▅▁▅▅▅▁▁▁▁▁▅▅▁▁▁▅ ▇
  13 ms         Histogram: log(frequency) by time      15.1 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions