More BitNet-like IQ1_S quantizations #8722

BugReporterZ · 2024-07-27T11:54:29Z

BugReporterZ
Jul 27, 2024

According to the original 1-bit BitNet paper, a fair number of model layers don't get quantized to low precision:

We leave the other components high-precision, e.g., 8-bit in our experiments. We summarized the reasons as follows. First, the residual connections and the layer normalization contribute negligible computation costs to large language models. Second, the computation cost of QKV transformation is much smaller than the parametric projection as the model grows larger. Third, we preserve the precision for the input/output embedding because the language models have to use high-precision probabilities to perform sampling.

If I understand correctly the following weights remain in 8-bit (or more):

Input/ouput tensors
Norm weights (with llama.cpp they already remain in F32 precision)
QKV layers

Indeed, in the subsequent BitNet b1.58 paper, which uses ternary values, they only claim a 7.16x reduction in memory for Llama-2 70B (simulated results), i.e. just above 18GB, supporting the idea that much of the model still remains in relatively high precision:

In llama.cpp, lowest-precision quantizations (IQ1_S) for Llama 3.1 70B have the precision scheme on the left in the image below (from HuggingFace), but what if they were quantized according to the one of the right? Files would be larger, but perhaps quality be substantially better?

EDIT: I'm not entirely sure if the attn_output.weight would be quantized to low precision as well in BitNet b1.58, but if not then of course memory usage would increase further. I can calculate roughly a 7.1x memory reduction for Llama-2-70B with the above scheme, for what it's worth.

BugReporterZ · 2024-07-27T17:29:24Z

BugReporterZ
Jul 27, 2024
Author

Coincidentally, the self-attention weights in the official Llama-3.1-405B FP8 model aren't quantized either. They kept embedding and output tensor in BF16 precision as well, together with the entirety of the first and last layer. Excerpt below from the official technical report:

https://huggingface.co/meta-llama/Meta-Llama-3.1-405B-Instruct-FP8/tree/main?show_file_info=model.safetensors.index.json

0 replies

BarfingLemurs · 2024-07-27T21:13:23Z

BarfingLemurs
Jul 27, 2024

3B Bitnet 1.58b

Feed-Forward Weights (68%) | Attention Weights (32%)

In descending order of space taken, the parameters are:

blk.4.ffn_down.weight: 105.47 MiB
blk.4.ffn_gate.weight: 105.47 MiB
blk.4.ffn_up.weight: 105.47 MiB
blk.4.attn_k.weight: 39.06 MiB
blk.4.attn_output.weight: 39.06 MiB
blk.4.attn_q.weight: 39.06 MiB
blk.4.attn_v.weight: 39.06 MiB
blk.4.ffn_sub_norm.weight: 0.033 MB
blk.4.attn_norm.weight: 0.012 MB
blk.4.ffn_norm.weight: 0.012 MB
blk.4.attn_sub_norm.weight: 0.012 MB

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

More BitNet-like IQ1_S quantizations #8722

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

More BitNet-like IQ1_S quantizations #8722

Uh oh!

Uh oh!

BugReporterZ Jul 27, 2024

Replies: 2 comments

Uh oh!

Uh oh!

BugReporterZ Jul 27, 2024 Author

Uh oh!

BarfingLemurs Jul 27, 2024

BugReporterZ
Jul 27, 2024

BugReporterZ
Jul 27, 2024
Author

BarfingLemurs
Jul 27, 2024