Replies: 1 comment
-
This isn't a full answer, but hopefully it will still help you. The quantized data isn't just a list of 3 or 4 bit numbers. The items are divided into chunks and each chunk contains some additional data, like the scale of the items in that chunk. This is because without the scales, representing those items with only 3, 4, etc bits will be much less accurate. To put it a different way, keeping track of the range of numbers that exist in a chunk means those 3 or 4 bit values can be relative to the range of values in that chunk rather than relative to range of values it's possible to express with a 16 bit float. Generally speaking, this approach applies to all the quantizations and not just You can look at https://github.com/ggerganov/llama.cpp/blob/master/k_quants.h to see the definitions for the blocks if you're able to read a little C. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi Team,
Thank you for your great contribution.
I had doubts about the 3/4 bit quantization inference, would be great if you could help me understand.
My doubts are:
1.) Whether the inference for 3/4 bit models actually happens in 3/4 bit or it is casted to FP16 during inference on CPU? I mean do you have a CPU kernel to run the inference on 3/4 bits?
2.) Which quantization technique is being used?
Thanks
Beta Was this translation helpful? Give feedback.
All reactions