Open
Description
Hi,
not a big deal but I noticed that Julia's CUFFT is 50% slower if performing a FFT on a real valued array.
If the array is complex, both have same speed. Also rfft
seems to work fine which is anyway recommended for a real valued array.
But sometimes you don't care and you just write plan_fft
because you don't assume the input type.
julia> using CUDA, CUDA.CUFFT, BenchmarkTools
julia> function ff(sz)
xc = CUDA.rand(Float32, sz...)
p = plan_fft(xc, (1,2))
@benchmark CUDA.@sync $p * $xc
end
ff (generic function with 1 method)
julia> ff((256, 256, 256))
BenchmarkTools.Trial: 1577 samples with 1 evaluation.
Range (min … max): 3.132 ms … 24.120 ms ┊ GC (min … max): 0.00% … 98.37%
Time (median): 3.152 ms ┊ GC (median): 0.00%
Time (mean ± σ): 3.167 ms ± 528.090 μs ┊ GC (mean ± σ): 1.17% ± 3.55%
▁▁▂▄▆▆▄▅▅▆█▂▄▃▂▄ ▂▂▃ ▁ ▁
▁▁▁▂▁▂▂▂▂▂▄▄▄▆▆▇██████████████████████▇███▇▆▅▅▅▅▄▂▃▃▃▂▂▃▂▂▂ ▄
3.13 ms Histogram: frequency by time 3.17 ms <
Memory estimate: 4.09 KiB, allocs estimate: 161.
julia> ff((150, 150, 150))
BenchmarkTools.Trial: 7474 samples with 1 evaluation.
Range (min … max): 654.434 μs … 37.787 ms ┊ GC (min … max): 0.00% … 98.53%
Time (median): 659.324 μs ┊ GC (median): 0.00%
Time (mean ± σ): 666.257 μs ± 429.687 μs ┊ GC (mean ± σ): 1.40% ± 4.80%
▃▇█▇▄ ▁ ▁
█████████▅▅▁▁▁▁▁▁▁▁▁▁▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▁▁▁▁▁▁▁▁▃▄▆▆▆ █
654 μs Histogram: log(frequency) by time 774 μs <
Memory estimate: 4.09 KiB, allocs estimate: 161.
In [26]: xc = torch.rand(256, 256, 256).cuda()
In [27]: %%timeit
...: # xc = torch.rand(256, 256, 256).cuda()
...: torch.fft.fft2(xc)
...: torch.cuda .synchronize()
...:
...:
2.23 ms ± 783 ns per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [28]: xc = torch.rand(150, 150, 150).cuda()
In [29]: %%timeit
...: # xc = torch.rand(256, 256, 256).cuda()
...: torch.fft.fft2(xc)
...: torch.cuda .synchronize()
...:
...:
466 µs ± 198 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
Input complex:
julia> function ff(sz)
xc = CUDA.rand(ComplexF32, sz...)
p = plan_fft(xc, (1,2))
@benchmark CUDA.@sync $p * $xc
end
ff (generic function with 1 method)
julia> ff((150, 150, 150))
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max): 345.697 μs … 46.356 ms ┊ GC (min … max): 0.00% … 98.87%
Time (median): 347.790 μs ┊ GC (median): 0.00%
Time (mean ± σ): 353.966 μs ± 460.525 μs ┊ GC (mean ± σ): 1.65% ± 3.04%
▁▄▆▇█▇▅▅▄▃
▂▂▂▃▄▅▇███████████▇▆▅▄▄▃▃▃▃▃▃▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▁▂▂▂▁▂▂▁▁▁▂ ▄
346 μs Histogram: frequency by time 356 μs <
Memory estimate: 672 bytes, allocs estimate: 21.
In [30]: xc = torch.rand(150, 150, 150).cuda() + 0j
In [31]: %%timeit
...: # xc = torch.rand(256, 256, 256).cuda()
...: torch.fft.fft2(xc)
...: torch.cuda .synchronize()
...:
...:
343 µs ± 168 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
Versions:
julia> versioninfo()
Julia Version 1.10.4
Commit 48d4fd48430 (2024-06-04 10:41 UTC)
Build Info:
Official https://julialang.org/ release
Platform Info:
OS: Linux (x86_64-linux-gnu)
CPU: 24 × AMD Ryzen 9 5900X 12-Core Processor
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-15.0.7 (ORCJIT, znver3)
Threads: 24 default, 0 interactive, 12 GC (on 24 virtual cores)
Environment:
JULIA_NUM_THREADS = 24
JULIA_MAX_NUM_PRECOMPILE_FILES = 100
julia> CUDA.versioninfo()
CUDA runtime 12.5, artifact installation
CUDA driver 12.3
NVIDIA driver 545.23.8
CUDA libraries:
- CUBLAS: 12.5.2
- CURAND: 10.3.6
- CUFFT: 11.2.3
- CUSOLVER: 11.6.2
- CUSPARSE: 12.4.1
- CUPTI: 23.0.0
- NVML: 12.0.0+545.23.8
Julia packages:
- CUDA: 5.4.2
- CUDA_Driver_jll: 0.9.0+0
- CUDA_Runtime_jll: 0.14.0+1
Toolchain:
- Julia: 1.10.4
- LLVM: 15.0.7
1 device:
0: NVIDIA GeForce RTX 3060 (sm_86, 5.010 GiB / 12.000 GiB available)
In [32]: torch.__version__
Out[32]: '2.0.1+cu117'