low performance of a simple ggml_mul_mat between GGML_TYPE_F16 and GGML_TYPE_Q4_0 #964
Replies: 1 comment
-
output: {'ggml_type': 2, 'shape': [8192, 8192], 'bad_offset': 548601856, 'item_type': <class 'numpy.uint8'>, 'item_count': 37748736, 'np_dims': (8192, 4608), 'offset': 549368736} |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
I perform a simple ggml_mul_mat between GGML_TYPE_F16 and GGML_TYPE_Q4_0.
Profile the process, I found that GGML_TYPE_Q4_0 was dequantized and then matmuled. It is very slow.
How to use the optimized cuda kernel?
ggml_mm.cpp
test.py
my_gguf.py
Beta Was this translation helpful? Give feedback.
All reactions