For matrix multiplication operations with quantized models, is the fp32 src1 first quantized to Q8_1? #11743
Unanswered
BishmoyPaul
asked this question in
Q&A
Replies: 1 comment 1 reply
-
@BishmoyPaul Excuse me, I saw that you left a successful reproduction result on the cap4video forum before. Can you tell me the specific running instructions? The result I reproduced was very poor. I earnestly request that you assist me |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
I was looking into
ggml_cuda_mul_mat
(ggml-cuda.cu L1844) to understand how it works for quantized models. It seems for quantized models, if the src1 (which is often the input/hidden state) is in FP32 format, it is first converted to Q8_1 first before the actual operation - for example, in L1899,quantize_row_q8_1_cuda
is being passed as an argument.Am I correct in assuming it is indeed using Q8_1 for src1? If so, my question is, why are we using Q8_1 for src1? I mean, for simpler quantizations like Q4_0, where the model weights are Q4_0 quantized, why would we convert src1 to Q8_1? Why not Q8_0?
Beta Was this translation helpful? Give feedback.
All reactions