-
Notifications
You must be signed in to change notification settings - Fork 12.4k
OpenCL: add mul_mat_f16_f32_image
kernel
#14635
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
The fact that benchmarking this single op shows a 3.4x speedup, while end-to-end pp performance only improved by 1.14x, indicates that the kernel spends more time waiting for data than it does on computation. This suggests that the pipeline is currently i/o bound rather compute-bound, and we are approaching the architectural limits of what OpenCL can offer on this hardware. Once a kernel finishes, its |
|
@lhez Well that's really annoying, this happened on all models for long prompt size which I had omitted in my testing. It was due to an incorrect global work size calculation that caused the kernel to not compute the full output. Now that it's fixed, the performance is on par with the existing tiled kernel. The fact that the two are converging to the same performance ceiling strongly suggests we're hitting the memory bandwidth bottleneck.
What do you think ? is there anything we can try or you think this new kernel is irrelevant and we should drop it and close the PR |
This PR adds a new
mul_mat_f16_f32_image
kernel that usesimage2d_t
to take advantage of the L1 texture cache. For now, it will completely override my previous generic tiled kernel, but I think the tiled one may still be useful when targeting other vendors, as this approach is very mobile GPU-specific.I think we could gain an additional 10% in performance by avoiding branch creation in the kernel and handling the case where K is divisible by 4 in a separate path. For now, we can keep it as is.
I will try to use a similar approach for
conv2d
, requires additional tricks tho. I think for now we're good enough for large matrix multiplies, I will next seek improvements in GEMV and cases not addressed by this kernel nextPerformance on adreno 830:
Master:
This PR: