Skip to content

OpenCL: add mul_mat_f16_f32_image kernel #14635

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

rmatif
Copy link
Collaborator

@rmatif rmatif commented Jul 11, 2025

This PR adds a new mul_mat_f16_f32_image kernel that uses image2d_t to take advantage of the L1 texture cache. For now, it will completely override my previous generic tiled kernel, but I think the tiled one may still be useful when targeting other vendors, as this approach is very mobile GPU-specific.

I think we could gain an additional 10% in performance by avoiding branch creation in the kernel and handling the case where K is divisible by 4 in a separate path. For now, we can keep it as is.

I will try to use a similar approach for conv2d, requires additional tricks tho. I think for now we're good enough for large matrix multiplies, I will next seek improvements in GEMV and cases not addressed by this kernel next

Performance on adreno 830:

Version m n k Performance
Master 4096 512 14336 375 GFLOPS
This PR 4096 512 14336 1.27 TFLOPS

Master:

model size params backend ngl test t/s
llama 1B F16 2.30 GiB 1.24 B OpenCL 99 pp512 168.17 ± 0.41
llama 1B F16 2.30 GiB 1.24 B OpenCL 99 tg128 22.61 ± 0.02

This PR:

model size params backend ngl test t/s
llama 1B F16 2.30 GiB 1.24 B OpenCL 99 pp512 193.79 ± 2.68
llama 1B F16 2.30 GiB 1.24 B OpenCL 99 tg128 22.67 ± 0.04

@rmatif rmatif assigned max-krasnyansky and lhez and unassigned max-krasnyansky and lhez Jul 11, 2025
@rmatif rmatif requested review from lhez and max-krasnyansky July 11, 2025 11:09
@github-actions github-actions bot added ggml changes relating to the ggml tensor library for machine learning OpenCL Issues specific to the OpenCL backend labels Jul 11, 2025
@rmatif
Copy link
Collaborator Author

rmatif commented Jul 11, 2025

The fact that benchmarking this single op shows a 3.4x speedup, while end-to-end pp performance only improved by 1.14x, indicates that the kernel spends more time waiting for data than it does on computation. This suggests that the pipeline is currently i/o bound rather compute-bound, and we are approaching the architectural limits of what OpenCL can offer on this hardware.

Once a kernel finishes, its __local memory is discarded. There is no way to chain operations, forcing a round-trip to global memory for each subsequent step again and again. Our only remaining path to achieve a significant performance leap is to implement FA and to aggressively fuse multiple ops into a single, monolithic kernel. For me this is the only way to overcome the memory bandwidth bottleneck

@lhez
Copy link
Collaborator

lhez commented Jul 15, 2025

Qwen2.5-3B F16 gives me gibberish output with this PR (on Adreno 750 and 830). The tiled version (in master) works good. Does Llama 1B F16 produce proper output with this PR?

@rmatif
Copy link
Collaborator Author

rmatif commented Jul 15, 2025

Qwen2.5-3B F16 gives me gibberish output with this PR (on Adreno 750 and 830). The tiled version (in master) works good. Does Llama 1B F16 produce proper output with this PR?

@lhez Well that's really annoying, this happened on all models for long prompt size which I had omitted in my testing. It was due to an incorrect global work size calculation that caused the kernel to not compute the full output.

Now that it's fixed, the performance is on par with the existing tiled kernel. The fact that the two are converging to the same performance ceiling strongly suggests we're hitting the memory bandwidth bottleneck.

model size params backend ngl test t/s
llama 1B F16 2.30 GiB 1.24 B OpenCL 99 pp512 166.45 ± 11.52
llama 1B F16 2.30 GiB 1.24 B OpenCL 99 tg128 22.44 ± 0.09

What do you think ? is there anything we can try or you think this new kernel is irrelevant and we should drop it and close the PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ggml changes relating to the ggml tensor library for machine learning OpenCL Issues specific to the OpenCL backend
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants