Vulkan: Tuning warptile for Mali GPU Performance #13483

rmatif · 2025-05-12T15:20:30Z

rmatif
May 12, 2025
Collaborator

I'm working on Local Diffusion, using stable-diffusion.cpp on Android. Vulkan performance on Mali GPUs is currently very poor

Disabling mul_mat_l in ggml-vulkan.cpp helped a bit. I then tried modifying the m_warptile and s_warptile values. Reducing the first element (m tile?) from 128 to 64 gave a ~3x inference speedup, but the output images were garbage/noisy.

Questions:

How can I correctly tune m_warptile and s_warptile for Mali GPUs to get both performance and correct output?
Are there specific alignment requirements for these values on Mali?
Do the matmul shaders need to be adapted if these warptile values are changed?

Looking for guidance to improve Vulkan matmul performance on Mali without breaking correctness

Answered by 0cc4m

May 17, 2025

You need to look into the meaning of the warptile parameters, they are not independent. I'll try to summarize what I remember:

The 11 parameters are: BLOCK_SIZE, BM, BN, BK, WM, WN, WMITER, TM, TN, TK and WARP.
They originate from this CUDA article, look at the kernel 10 information: https://siboehm.com/articles/22/CUDA-MMM
Especially the diagram is helpful.

For your problem: You need to make sure that the amount of warps in the subgroup (BLOCK_SIZE) is identical to the amount of warptiles. For example in the Nvidia case (warps of size 32) we have a subgroup of size BLOCK_SIZE=128, meaning 4 warps. BM=64, BN=64 and WM=32, WN=32 means we have 4 tiles. This is why it works.

In your WARP=16 …

View full answer

jeffbolznv · 2025-05-15T13:19:10Z

jeffbolznv
May 15, 2025
Collaborator

What is the warp size and shared memory size for this GPU? These should be printed out on startup.

The first value is the workgroup size. I'm surprised this broke things unless the workgroup size is smaller than the warp size.

Which is currently faster, m_warptile or s_warptile?

0 replies

rmatif · 2025-05-15T13:59:12Z

rmatif
May 15, 2025
Collaborator Author

@jeffbolznv

Thanks for the response. Here's the warp size and shared memory size of the GPU:

I pretty much brute-forced all possible combinations while tuning m_warptile and s_warptile, but it always resulted in broken output. That said, I only tested it with stable-diffusion.cpp and not with llama.cpp, though theoretically it should behave the same, since it's just mat_mul under the hood, unless the im2col op is somehow affecting it. I’ll try it later today with llama.cpp to confirm

In my case, m_warptile turned out to be significantly faster than s_warptile. I don’t remember the exact numbers, but the difference was very noticeable

6 replies

rmatif May 15, 2025
Collaborator Author

@jeffbolznv

You were right.I just tested this in llama.cpp, and changing the first element of m_warptile to 64 doesn't break anything, but it also doesn't provide a performance gain

The issue seems specific to stable-diffusion.cpp. Since most of the workload involves conv_2d, I thought maybe im2col required some sort of alignment or something. Other than that, I’m not sure why it works fine in llama.cpp but not in sd.cpp, any hint ?

jeffbolznv May 16, 2025
Collaborator

I don't know why it would work in llama but not sd.

0cc4m May 16, 2025
Collaborator

@rmatif Instead of testing with stable-diffusion.cpp or llama.cpp inference, make sure the MUL_MAT unit tests (test-backend-ops binary) pass, first. There are also more ways to debug the Vulkan backend internally, if required, but as a first step the unit tests should be enough.

rmatif May 16, 2025
Collaborator Author

@0cc4m Thanks for the suggestion. I ran test-backend-ops with the m_warptile workgroup size set to both 128 and 64. There are two mul_mat ops that fail with size 64 but pass with 128. I believe this is the root of the problem. Here are the details of the two op failures with size 64:

m_warptile = { 64, 64, 64, 16, subgroup_size_8, 32, 2, tm_m, tn_m, tk_m, subgroup_size_8 };

  MUL_MAT(type_a=f16,type_b=f32,m=64,n=45,k=128,bs=[8,1],nr=[4,1],per=[0,1,2,3]): [MUL_MAT] NMSE = 0.408776376 > 0.000500000 �[1;31mFAIL�[0m
  MUL_MAT(type_a=f16,type_b=f32,m=128,n=45,k=64,bs=[8,1],nr=[4,1],per=[0,1,2,3]): [MUL_MAT] NMSE = 0.418035493 > 0.000500000 �[1;31mFAIL�[0m

I was wondering what happens in the case of a failure in ops, but the ops are still invoked during inference. Are they just ignored and passed over, and is that why I experienced a speedup in inference?

0cc4m May 16, 2025
Collaborator

No, the tests just exist to pinpoint issues that were accidentally introduced, or to check new ops that get implemented for correctness, they have no influence on inference.

rmatif · 2025-05-17T06:42:05Z

rmatif
May 17, 2025
Collaborator Author

@0cc4m On my GTX 1060 I was able to reproduce the ops failure and the broken sd.cpp inference when switching to a workgroup size of 64,. So it's not a Mali-specific issue. I think you can reproduce it on your end as well. I'm not sure how to dig deeper into this, if you could help or take a look by yourself, that would be greatly appreciated. I believe that with a properly tuned Vulkan backend on Mali can already match or even surpass CPU performance on sd.cpp

3 replies

0cc4m May 17, 2025
Collaborator

You need to look into the meaning of the warptile parameters, they are not independent. I'll try to summarize what I remember:

The 11 parameters are: BLOCK_SIZE, BM, BN, BK, WM, WN, WMITER, TM, TN, TK and WARP.
They originate from this CUDA article, look at the kernel 10 information: https://siboehm.com/articles/22/CUDA-MMM
Especially the diagram is helpful.

For your problem: You need to make sure that the amount of warps in the subgroup (BLOCK_SIZE) is identical to the amount of warptiles. For example in the Nvidia case (warps of size 32) we have a subgroup of size BLOCK_SIZE=128, meaning 4 warps. BM=64, BN=64 and WM=32, WN=32 means we have 4 tiles. This is why it works.

In your WARP=16 case, by default you have a BLOCK_SIZE=128, meaning 8 warps. BM=64, BN=64 and WM=32, WN=16 means 8 tiles. If you reduce the subgroup size to 64 you only have 4 warps and it doesn't work out. You need to adapt WM and WN so that you end up with 4 tiles as well.

The remaining values WMITER, TM and TN are less error-prone, you can try a few (low) values to see if it makes a difference. It only affects the distribution of work for each thread. TK doesn't do anything without coopmat.

Usually you want to keep WARP identical to the physical warp/wave size of the GPU, but because the mul_mm shader doesn't use any subgroup intrinsics, you can change it as well without any immediate issues. On AMD GCN there were some tuning results where reducing this below the actual warp size helped performance.

I hope this helps. Let me know if you have further questions.

Answer selected by rmatif

rmatif May 22, 2025
Collaborator Author

@0cc4m

Thank you very much! It was really helpful in understanding what the values I was trying to tweak actually refer to. Although it's a bit difficult to get into at first.

Unfortunately, I didn’t see any performance gains from tuning the warptile. I even tested it on a more recent and powerful GPU (the ARM Immortalis-G715 MC10), but regardless of what I tried, the performance consistently lagged behind the CPU.

I'm not sure if there’s anything else I can try that just involves tuning, but hopefully one day we’ll get better support on Android

0cc4m May 23, 2025
Collaborator

Thank you for trying it. I think mobile GPUs might need a different kind of matmul shader, since this one relies on large caches, which dedicated GPUs have, but mobile might be different. But I don't know what the best way to optimize for mobile is.

Vulkan: Tuning warptile for Mali GPU Performance #13483

Uh oh!

rmatif May 12, 2025 Collaborator

Replies: 3 comments · 9 replies

Uh oh!

jeffbolznv May 15, 2025 Collaborator

Uh oh!

rmatif May 15, 2025 Collaborator Author

Uh oh!

rmatif May 15, 2025 Collaborator Author

Uh oh!

jeffbolznv May 16, 2025 Collaborator

Uh oh!

0cc4m May 16, 2025 Collaborator

Uh oh!

Uh oh!

rmatif May 16, 2025 Collaborator Author

Uh oh!

0cc4m May 16, 2025 Collaborator

Uh oh!

rmatif May 17, 2025 Collaborator Author

Uh oh!

Uh oh!

0cc4m May 17, 2025 Collaborator

Uh oh!

rmatif May 22, 2025 Collaborator Author

Uh oh!

0cc4m May 23, 2025 Collaborator

rmatif
May 12, 2025
Collaborator

Replies: 3 comments 9 replies

jeffbolznv
May 15, 2025
Collaborator

rmatif
May 15, 2025
Collaborator Author

rmatif May 15, 2025
Collaborator Author

jeffbolznv May 16, 2025
Collaborator

0cc4m May 16, 2025
Collaborator

rmatif May 16, 2025
Collaborator Author

0cc4m May 16, 2025
Collaborator

rmatif
May 17, 2025
Collaborator Author

0cc4m May 17, 2025
Collaborator

rmatif May 22, 2025
Collaborator Author

0cc4m May 23, 2025
Collaborator