-
Notifications
You must be signed in to change notification settings - Fork 12.4k
ggml: adds CONV_2D op and direct GEMM Vulkan implementation #14316
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very cool!
On my RX 470 the indirect op is faster as well. IMO it's worth testing with more input and kernel sizes like what we have for im2col, and the real test to get this set up with stablediffusion.cpp (though that thing hasn't been updated for months) to see how it does with an actual model.
|
Sure, older models might introduce other bottlenecks that causes the shader to slow down but the memory saving still a considerable advantage. I'm thinking about reimplementing the shader in CUDA so I can profile it with Nsight to see what causes the issue (hopefully it still supports ancient cards). |
For curiousity I have tested it on a mali GPU ./test-backend-ops -o CONV_2D_DIRECT_IMPL -b Vulkan0 perf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Mali-G715 (Mali-G715) | uma: 1 | fp16: 1 | warp size: 16 | shared memory: 32768 | int dot: 1 | matrix cores: KHR_coopmat
Testing 2 devices
Backend 1/2: Vulkan0
Device description: Mali-G715
Device memory: 11229 MB (11229 MB free)
CONV_2D_DIRECT_IMPL(ne_input=[19,19,256,16],ne_kernel=[4,4,256,4096],stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0): 1 runs - 9295123.00 us/run - 137.42 GFLOP/run - 14.78 GFLOPS
Backend Vulkan0: OK
Backend 2/2: CPU
Skipping
2/2 backends passed
OK
./test-backend-ops -o CONV_2D_INDIRECT_IMPL -b Vulkan0 perf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Mali-G715 (Mali-G715) | uma: 1 | fp16: 1 | warp size: 16 | shared memory: 32768 | int dot: 1 | matrix cores: KHR_coopmat
Testing 2 devices
Backend 1/2: Vulkan0
Device description: Mali-G715
Device memory: 11229 MB (11229 MB free)
CONV_2D_INDIRECT_IMPL(ne_input=[19,19,256,16],ne_kernel=[4,4,256,4096],stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0): 1 runs - 4126959.00 us/run - 137.42 GFLOP/run - 33.30 GFLOPS
Backend Vulkan0: OK
Backend 2/2: CPU
Skipping
2/2 backends passed
OK |
Hi, thanks for testing! Please disable coopmats for fair comparison, because my alg is currently fp32 scalar while the indirect is mixed precision that uses matrix cores. Anyway, I already found cases where my alg is slower and I will update it soon. |
Here without coopmat: GGML_VK_DISABLE_COOPMAT=1 ./test-backend-ops -o CONV_2D_DIRECT_IMPL -b Vulkan0 perf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Mali-G715 (Mali-G715) | uma: 1 | fp16: 1 | warp size: 16 | shared memory: 32768 | int dot: 1 | matrix cores: none
Testing 2 devices
Backend 1/2: Vulkan0
Device description: Mali-G715
Device memory: 11229 MB (11229 MB free)
CONV_2D_DIRECT_IMPL(ne_input=[19,19,256,16],ne_kernel=[4,4,256,4096],stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0): 1 runs - 12637915.00 us/run - 137.42 GFLOP/run - 10.87 GFLOPS
Backend Vulkan0: OK
Backend 2/2: CPU
Skipping
2/2 backends passed
OK
GGML_VK_DISABLE_COOPMAT=1 ./test-backend-ops -o CONV_2D_INDIRECT_IMPL -b Vulkan0 perf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Mali-G715 (Mali-G715) | uma: 1 | fp16: 1 | warp size: 16 | shared memory: 32768 | int dot: 1 | matrix cores: none
Testing 2 devices
Backend 1/2: Vulkan0
Device description: Mali-G715
Device memory: 11229 MB (11229 MB free)
CONV_2D_INDIRECT_IMPL(ne_input=[19,19,256,16],ne_kernel=[4,4,256,4096],stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0): 0: 0x7b4d6d8378
1: 0x7b4d6d8284 ggml_print_backtrace
2: 0x7b4d6e8f3c
3: 0x7b4d74864c
4: 0x7b4d762c30 __cxa_get_exception_ptr
5: 0x7b4d762c0c
6: 0x7b4dad988c
7: 0x7b4dabae48
8: 0x7b4dab57d0
9: 0x7b4d6ebb54 ggml_backend_graph_compute
10: 0x5a933e51a8
11: 0x5a933d7ce8
12: 0x7b48712218 __libc_init
libc++abi: terminating due to uncaught exception of type vk::DeviceLostError: vk::Device::getFenceStatus: ErrorDeviceLost
Aborted For some reason it crashes when running the indirect conv2d without coopmat |
This is cool. Here are results from my hardware (with coopmat/coopmat2 disabled):
|
Unfortunately this branch contains a logical error so does the code in this pull request so the improvement is smaller. I will update the correct version Today (it still has an edge over the indirect impl on my 2060, so it might be worth it to test). I have not updated it recently because I am working on better shared memory handling because the bank conflicts slow down the kernel too much. Edit: deleted the branch to prevent further confusions. |
Fair enough, I'll redo the test once you publish the fixed version. |
@0cc4m I've fixed some trivial errors in my out of tree update: etasnadi@50a29f4 So I am curious how fast it is on your devices. My experiments show that it is 15% faster than im2col+SGEMM (indirect implementation) on my Pascal device for a large matrix and 40% faster on my Turing desktop GPU on a large problem (4096x4096x4096) while using far less memory (my alg does not store the im2col matrix consuming as much space as @netrunnereve I added a few test cases for performance measurements of shapes that are common in convolutional neural networks. I simplified my ifs to minimize branch divergence so I do not need macros anymore (I falsely assumed that the compiler will do this). This made my kernel equally fast to the indirect op on my old device. The kernel executes many non-const divisions when loading data so this seriously affected the performance on my older device so I added support for collective ops (warp shuffle) to mitigate this issue probably caused by the limited number of SFUs. Such ops were introduced in Kepler, but can be disabled with a macro if we want to support even older hardware. My code still have serious bank conflicts I chose to not to eliminate yet because the fix would not be compatible with coopmats.
|
Here are updated values using your new branch:
Performance looks good. |
I guess these are the mean flops for all test cases. IDK the issue with Intel, can you attach a log to see which tests are failing? |
Here are my numbers with etasnadi@50a29f4 on my 470. The last test is running a bit slower but otherwise everything looks good.
@0cc4m did you run your tests using the specific commit etasnadi@50a29f4? The ggml/conv_2d branch is outdated and only has a single test. |
* ggml-vulkan: adds f32 scalar shader to compute 2D convolution directly with gemm (no need for im2col), * test-backend-ops: adds test_case_ref to check the validity/performance of ops against reference implementations having different graphs, adds tests
eliminate redundant calculation, macros removed. * Kernel shared memory size check * Updates test-backend-ops to support graphs for performance measurement.
I used the new branch, but only copied the TFLOPS number from the first test, the large one. I made sure the others were improved as well, though. |
* Subgroup size used to determine tile size -> fixes llvmpipe errors.
7f9b659 might work on Intel too (at least it is now working in llvmpipe as previous coommit that you tested contained a bug dependent on the subgroup size). |
@0cc4m I refactored
|
Here are full results: ggml_vulkan: 0 = NVIDIA GeForce RTX 3090 (NVIDIA) | uma: 0 | fp16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: none
ggml_vulkan: 0 = AMD Radeon (TM) Pro VII (RADV VEGA20) (radv) | uma: 0 | fp16: 1 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: none
ggml_vulkan: 0 = Intel(R) Arc(tm) A770 Graphics (DG2) (Intel open-source Mesa driver) | uma: 0 | fp16: 1 | warp size: 32 | shared memory: 65536 | int dot: 1 | matrix cores: none
The tests now pass on Intel, but performance is terrible. There might be some subgroup size stuff we can try to fix this. |
The new commit a09e8f5 disables subgroup ops completely by default (can be enabled with |
@0cc4m Please also report the string " --> BS_CRS=%d use_collectives=%d" printed to the stderr on Intel, that might be useful to be sure that subgroup sizes are properly configured. |
What does help is forcing the subgroup size to BS_CRS=16 on Intel: diff --git a/ggml/src/ggml-vulkan/ggml-vulkan.cpp b/ggml/src/ggml-vulkan/ggml-vulkan.cpp
index 2eb7415c5..e1bba0f84 100644
--- a/ggml/src/ggml-vulkan/ggml-vulkan.cpp
+++ b/ggml/src/ggml-vulkan/ggml-vulkan.cpp
@@ -3054,10 +3054,12 @@ static void ggml_vk_load_shaders(vk_device& device) {
conv2d_BS_CRS = std::min(device->subgroup_size, conv2d_BS_CRS);
}
}
-
+
std::cerr << " --> BS_CRS=" << conv2d_BS_CRS << " use_collectives=" << use_collectives << std::endl;
- if(device->subgroup_shuffle){
+ if(device->subgroup_shuffle && device->subgroup_size_control && device->subgroup_min_size <= 16 && device->subgroup_max_size >= 16){
+ ggml_vk_create_pipeline(device, device->pipeline_conv2d_f32, "conv2d_f32", conv2d_f32_len, conv2d_f32_data, "main", 3, sizeof(vk_op_conv2d_push_constants), {conv2d_BS_K, conv2d_BS_NPQ, 1}, {conv2d_WG_SIZE, conv2d_BS_K, conv2d_BS_CRS, conv2d_BS_NPQ, conv2d_TS_K, use_collectives}, 1, true, true, 16);
+ }else if(device->subgroup_shuffle){
ggml_vk_create_pipeline(device, device->pipeline_conv2d_f32, "conv2d_f32", conv2d_f32_len, conv2d_f32_data, "main", 3, sizeof(vk_op_conv2d_push_constants), {conv2d_BS_K, conv2d_BS_NPQ, 1}, {conv2d_WG_SIZE, conv2d_BS_K, conv2d_BS_CRS, conv2d_BS_NPQ, conv2d_TS_K, use_collectives}, 1, true, true);
}else{
ggml_vk_create_pipeline(device, device->pipeline_conv2d_f32, "conv2d_f32", conv2d_f32_len, conv2d_f32_data, "main", 3, sizeof(vk_op_conv2d_push_constants), {conv2d_BS_K, conv2d_BS_NPQ, 1}, {conv2d_WG_SIZE, conv2d_BS_K, conv2d_BS_CRS, conv2d_BS_NPQ, conv2d_TS_K, use_collectives}, 1, true);
But this is still worse than indirect and also than your (incorrect) earlier attempt. If you can think of something we can give it a shot, but if not it's fine. |
Might be one more shot if you are still motivated. I have an
What we have not tried yet is to disable collectives and set Can you set Please use a09e8f5 as the most recent version disables Intel support. |
Yeah, when not forcing full subgroups and collectives, it works correctly and is fast:
But we still want to use them on Nvidia and AMD since they make a measurable positive difference there. As a side note, you might have triggered a MoltenVK shader compiler bug:
|
Great, I did not expect the 2.5-3x speedup (and correct output at the same time). No problem with the collectives because we disable them for Intel and enable it otherwise in a new commit. The apple bug is entirely new info for me, does not it have a Github pipeline to check which versions of my code have crashed? I do not have access to any apple devices currently and this is probably not enough info to locate the error. Debug compile mode execution log and validation layers output might show more info to better locate the error. All other kernels pass the tests on MoltenVK? Could you please tweak the usual hotspots e.g enable/disable the shuffle op and full_subgroups? If the problem is with the subgroups then I might do something illegal in the kernel that does not pop up on Nvidia/AMD or both MoltenVK and Intel driver have some bug. Thanks! |
It happens regardless of the collectives or full_subgroups setting. Something else is triggering it. Some other shaders are failing to output correct results on Apple, but your conv2d is the only one that crashes on build currently. Without the hardware there's nothing you can do to debug it. I'll try to figure out how to get MoltenVK to provide more info, but it's not overly important. Apple users should be using Metal in most cases, I only know some niche docker/VM cases where that is not possible, but Vulkan is. I just tried it out of curiosity. |
I guess you mean pipeline creation on build don't you? One guess: my pipeline configures the workgroup size using a spec constant ( |
It crashes on pipeline creation during runtime, which is the final compile step from SPIR-V to device-specific code, in this case SPIR-V to Metal. |
Then there are chances that local_size_x_id is actually the problem. |
The last commit a672803 should be fast on all tested archs, but might need a test whether the op is successfully disabled on Apple and collectives are turned off on Intel. |
Apple looks correct:
AMD:
Nvidia too:
However, when you reenable coopmat:
And coopmat2:
Is the direct path disabled already when coopmat or coopmat2 are available? |
What I need to see here? The The |
I hadn't looked at the op details yet. In that case, when we start using the CONV_2D op in models both direct and indirect options should be possible. |
Yes. There are advantages of the im2col version too: if a highly optimized linalg library is available that is already optimized for each single device (as it is the case in the Cuda backend with cuBLAS), then the indirect op can be more competitive at the tradeoff of wasting memory. |
@0cc4m Do I need to add anything to this PR to get approved? One pipeline fails sometimes but it does not seem to be affected by this PR. |
This patch adds support for direct computation of 2D convolution on Vulkan backend: it is in a form of a custom GEMM that loads the relevant data from the kernel and input to the shared memory therefore it does not need the materialization of the convolution matrix in the global memory with im2col thus saving lots of memory - similarly how the op implemented in cuDNN. This logic can theoretically result in faster kernels than im2col->matmul because the transfer of the full matrix between GMEM and registers is not needed and the repeating elements for the (virtual) helper matrix can be pulled from L2.
The performance is 2x compared to im2col->matmul on RTX 2060 (2.15 TFLOPS compared to 4.10 TFLOPS according to
test-backend-ops
theoretical max is ~6 TFLOPS):As a negative result, the indirect op is signiticantly faster on a GTX 1060 notebook (1.73 vs 1.21 TFLOPS -- theoretical max is ~3 TFLOPS) might be because blocktile sizes are too big for this older hardware.
The PR also adds support to compare ops with different implementation graphs in
test-backend-ops
, so one can compare/test the actual (potentially fused and optimized op under development) to a reference op that does not have a direct implementation on CPU yet making op development faster.