More BitNet-like IQ1_S quantizations #8722
Replies: 2 comments
-
Coincidentally, the self-attention weights in the official Llama-3.1-405B FP8 model aren't quantized either. They kept embedding and output tensor in BF16 precision as well, together with the entirety of the first and last layer. Excerpt below from the official technical report: |
Beta Was this translation helpful? Give feedback.
-
3B Bitnet 1.58b Feed-Forward Weights (68%) | Attention Weights (32%) In descending order of space taken, the parameters are:
|
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
According to the original 1-bit BitNet paper, a fair number of model layers don't get quantized to low precision:
If I understand correctly the following weights remain in 8-bit (or more):
Indeed, in the subsequent BitNet b1.58 paper, which uses ternary values, they only claim a 7.16x reduction in memory for Llama-2 70B (simulated results), i.e. just above 18GB, supporting the idea that much of the model still remains in relatively high precision:
In llama.cpp, lowest-precision quantizations (IQ1_S) for Llama 3.1 70B have the precision scheme on the left in the image below (from HuggingFace), but what if they were quantized according to the one of the right? Files would be larger, but perhaps quality be substantially better?
EDIT: I'm not entirely sure if the
attn_output.weight
would be quantized to low precision as well in BitNet b1.58, but if not then of course memory usage would increase further. I can calculate roughly a 7.1x memory reduction for Llama-2-70B with the above scheme, for what it's worth.Beta Was this translation helpful? Give feedback.
All reactions