[ET-VK] Better work group sizes for matmul #13378

pytorchbot · 2025-08-13T17:52:41Z

This PR was created by the merge bot to help merge the original PR into the main branch.
ghstack PR number: #13185 by @SS-JIA
^ Please use this as the source of truth for the PR details, comments, and reviews
ghstack PR base: https://github.com/pytorch/executorch/tree/gh/SS-JIA/272/base
ghstack PR head: https://github.com/pytorch/executorch/tree/gh/SS-JIA/272/head
Merge bot PR base: https://github.com/pytorch/executorch/tree/main
Merge bot PR head: https://github.com/pytorch/executorch/tree/gh/SS-JIA/272/orig
@diff-train-skip-merge

Pull Request resolved: #13185 ## Context Currently `default_pick_local_wg_size()` (which internally calls `ComputeGraph::create_local_wg_size`) is used to select the local work group size for matrix multiplication ops. However, these functions currently bias the size of the local work group towards the largest dim of the global work group producing local wg sizes like ``` shader globalwg size localwg size =========== ===================== ==================== ============= linear_qga4w_tiled_texture3d_texture3d_texture2d_float {256, 29, 1} {32, 2, 1} 1487 matmul_naive_texture3d_float {29, 115, 32} {4, 2, 8} 712 ``` for matrix multiplication shaders. This behaviour was introduced in D64418632 / #6409. However, through experimental testing a "square" work group size of `{8, 8, 1}` works a lot better for matrix multiplication shaders. The theoretical analysis for this behaviour is that the local work group size determines the memory locations that need to be loaded to compute the overall work group. For a work group with size `{W, H, 1}` the data required to compute the output would be `W * OUTPUT_TILE_W` columns of the weight tensor and `H * OUTPUT_TILE_H` rows of the input tensor. Note that all work group items in the same W index will be requesting the same columns from the weight tensor, and all work group items in the same H index will be requesting the same rows from the input tensor. If `H==W`, then that "balances" the amount of data needed to loaded from each input tensor and may result in better data sharing behaviour among all work group items. Assuming `OUTPUT_TILE_W == OUTPUT_TILE_H == 1`, a local work group of size `{64, 1, 1}` would require 1 unique row from the input tensor an 64 unique columns to be loaded from the weight tensor, resulting in `(1 + 64) * K = 65K` elements to be loaded in total, where K is the size of the shared reduction dim. Conversely, a local work group of size `{8, 8, 1}` would require 8 unique rows / 8 unique columns resulting in only `(8 + 8) * K = 16K` unique elements to be loaded. This highlights the need to use dedicated logic to compute work group sizes for matrix multiplication shaders. ## Changes * Introduce `pick_hw_square_wg_size` * Use the new local work group size determination function for Quantized Linear, Matmul, and Linear ghstack-source-id: 302703877 Differential Revision: [D79813236](https://our.internmc.facebook.com/intern/diff/D79813236/)

pytorch-bot · 2025-08-13T17:52:45Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/13378

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

⏳ No Failures, 87 Pending

As of commit 97c9673 with merge base b36d6b6 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

github-actions · 2025-08-13T17:53:26Z

This PR needs a `release notes:` label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

## Context Currently `default_pick_local_wg_size()` (which internally calls `ComputeGraph::create_local_wg_size`) is used to select the local work group size for matrix multiplication ops. However, these functions currently bias the size of the local work group towards the largest dim of the global work group producing local wg sizes like ``` shader globalwg size localwg size =========== ===================== ==================== ============= linear_qga4w_tiled_texture3d_texture3d_texture2d_float {256, 29, 1} {32, 2, 1} 1487 matmul_naive_texture3d_float {29, 115, 32} {4, 2, 8} 712 ``` for matrix multiplication shaders. This behaviour was introduced in D64418632 / pytorch#6409. However, through experimental testing a "square" work group size of `{8, 8, 1}` works a lot better for matrix multiplication shaders. The theoretical analysis for this behaviour is that the local work group size determines the memory locations that need to be loaded to compute the overall work group. For a work group with size `{W, H, 1}` the data required to compute the output would be `W * OUTPUT_TILE_W` columns of the weight tensor and `H * OUTPUT_TILE_H` rows of the input tensor. Note that all work group items in the same W index will be requesting the same columns from the weight tensor, and all work group items in the same H index will be requesting the same rows from the input tensor. If `H==W`, then that "balances" the amount of data needed to loaded from each input tensor and may result in better data sharing behaviour among all work group items. Assuming `OUTPUT_TILE_W == OUTPUT_TILE_H == 1`, a local work group of size `{64, 1, 1}` would require 1 unique row from the input tensor an 64 unique columns to be loaded from the weight tensor, resulting in `(1 + 64) * K = 65K` elements to be loaded in total, where K is the size of the shared reduction dim. Conversely, a local work group of size `{8, 8, 1}` would require 8 unique rows / 8 unique columns resulting in only `(8 + 8) * K = 16K` unique elements to be loaded. This highlights the need to use dedicated logic to compute work group sizes for matrix multiplication shaders. ## Changes * Introduce `pick_hw_square_wg_size` * Use the new local work group size determination function for Quantized Linear, Matmul, and Linear Differential Revision: [D79813236](https://our.internmc.facebook.com/intern/diff/D79813236/)

pytorchbot requested a review from SS-JIA as a code owner August 13, 2025 17:52

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Aug 13, 2025

SS-JIA approved these changes Aug 13, 2025

View reviewed changes

SS-JIA merged commit f95a3f7 into main Aug 13, 2025
96 of 104 checks passed

SS-JIA deleted the gh/SS-JIA/272/orig branch August 13, 2025 18:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[ET-VK] Better work group sizes for matmul #13378

[ET-VK] Better work group sizes for matmul #13378

Uh oh!

pytorchbot commented Aug 13, 2025

Uh oh!

pytorch-bot bot commented Aug 13, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Aug 13, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[ET-VK] Better work group sizes for matmul #13378

[ET-VK] Better work group sizes for matmul #13378

Uh oh!

Conversation

pytorchbot commented Aug 13, 2025

Uh oh!

pytorch-bot bot commented Aug 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/13378

⏳ No Failures, 87 Pending

Uh oh!

github-actions bot commented Aug 13, 2025

This PR needs a release notes: label

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

pytorch-bot bot commented Aug 13, 2025 •

edited

Loading

This PR needs a `release notes:` label