OpenCL: add `mul_mat_f16_f32_image` kernel #14635

rmatif · 2025-07-11T11:07:07Z

This PR adds a new mul_mat_f16_f32_image kernel that uses image2d_t to take advantage of the L1 texture cache. For now, it will completely override my previous generic tiled kernel, but I think the tiled one may still be useful when targeting other vendors, as this approach is very mobile GPU-specific.

I think we could gain an additional 10% in performance by avoiding branch creation in the kernel and handling the case where K is divisible by 4 in a separate path. For now, we can keep it as is.

I will try to use a similar approach for conv2d, requires additional tricks tho. I think for now we're good enough for large matrix multiplies, I will next seek improvements in GEMV and cases not addressed by this kernel next

Performance on adreno 830:

Version	m	n	k	Performance
Master	4096	512	14336	375 GFLOPS
This PR	4096	512	14336	1.27 TFLOPS

Master:

model	size	params	backend	ngl	test	t/s
llama 1B F16	2.30 GiB	1.24 B	OpenCL	99	pp512	168.17 ± 0.41
llama 1B F16	2.30 GiB	1.24 B	OpenCL	99	tg128	22.61 ± 0.02

This PR:

model	size	params	backend	ngl	test	t/s
llama 1B F16	2.30 GiB	1.24 B	OpenCL	99	pp512	193.79 ± 2.68
llama 1B F16	2.30 GiB	1.24 B	OpenCL	99	tg128	22.67 ± 0.04

rmatif · 2025-07-11T17:04:32Z

The fact that benchmarking this single op shows a 3.4x speedup, while end-to-end pp performance only improved by 1.14x, indicates that the kernel spends more time waiting for data than it does on computation. This suggests that the pipeline is currently i/o bound rather compute-bound, and we are approaching the architectural limits of what OpenCL can offer on this hardware.

Once a kernel finishes, its __local memory is discarded. There is no way to chain operations, forcing a round-trip to global memory for each subsequent step again and again. Our only remaining path to achieve a significant performance leap is to implement FA and to aggressively fuse multiple ops into a single, monolithic kernel. For me this is the only way to overcome the memory bandwidth bottleneck

lhez · 2025-07-15T03:22:58Z

Qwen2.5-3B F16 gives me gibberish output with this PR (on Adreno 750 and 830). The tiled version (in master) works good. Does Llama 1B F16 produce proper output with this PR?

ggml/src/ggml-opencl/ggml-opencl.cpp

rmatif · 2025-07-15T11:37:27Z

Qwen2.5-3B F16 gives me gibberish output with this PR (on Adreno 750 and 830). The tiled version (in master) works good. Does Llama 1B F16 produce proper output with this PR?

@lhez Well that's really annoying, this happened on all models for long prompt size which I had omitted in my testing. It was due to an incorrect global work size calculation that caused the kernel to not compute the full output.

Now that it's fixed, the performance is on par with the existing tiled kernel. The fact that the two are converging to the same performance ceiling strongly suggests we're hitting the memory bandwidth bottleneck.

model	size	params	backend	ngl	test	t/s
llama 1B F16	2.30 GiB	1.24 B	OpenCL	99	pp512	166.45 ± 11.52
llama 1B F16	2.30 GiB	1.24 B	OpenCL	99	tg128	22.44 ± 0.09

What do you think ? is there anything we can try or you think this new kernel is irrelevant and we should drop it and close the PR

add mul_mat_f16_f32_image kernel

418606e

rmatif assigned max-krasnyansky and lhez and unassigned max-krasnyansky and lhez Jul 11, 2025

rmatif requested review from lhez and max-krasnyansky July 11, 2025 11:09

github-actions bot added ggml changes relating to the ggml tensor library for machine learning OpenCL Issues specific to the OpenCL backend labels Jul 11, 2025

fix trailing whitespace

18310bf

lhez reviewed Jul 15, 2025

View reviewed changes

ggml/src/ggml-opencl/ggml-opencl.cpp Outdated Show resolved Hide resolved

ggml/src/ggml-opencl/ggml-opencl.cpp Outdated Show resolved Hide resolved

fix and opt kernel launch

f68669d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

OpenCL: add `mul_mat_f16_f32_image` kernel #14635

OpenCL: add `mul_mat_f16_f32_image` kernel #14635

Uh oh!

rmatif commented Jul 11, 2025 •

edited

Loading

Uh oh!

rmatif commented Jul 11, 2025

Uh oh!

lhez commented Jul 15, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

rmatif commented Jul 15, 2025

Uh oh!

Uh oh!

OpenCL: add mul_mat_f16_f32_image kernel #14635

Are you sure you want to change the base?

OpenCL: add mul_mat_f16_f32_image kernel #14635

Uh oh!

Conversation

rmatif commented Jul 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rmatif commented Jul 11, 2025

Uh oh!

lhez commented Jul 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rmatif commented Jul 15, 2025

Uh oh!

Uh oh!

OpenCL: add `mul_mat_f16_f32_image` kernel #14635

OpenCL: add `mul_mat_f16_f32_image` kernel #14635

rmatif commented Jul 11, 2025 •

edited

Loading

lhez commented Jul 15, 2025 •

edited

Loading