Replies: 1 comment
-
These tensors are usually very small, and the cost of quantizing them is low compared to the overall cost of the matrix multiplication. For example, during generation with llama 7B, most matrix multiplications have dimensions |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Looking at ggml code:
https://github.com/ggerganov/llama.cpp/blob/master/ggml.c#L1675
Why do all the quantized dot products encode working tensors (non-model) into q8 first?
Isn't this more work and memory bandwidth than just sending the F16 or F32 directly?
Thanks!
Beta Was this translation helpful? Give feedback.
All reactions