LLaMA-3.2 quantization evaluation #63
ikawrakow
started this conversation in
Show and tell
Replies: 2 comments
-
Here some performance numbers for the 1B model on a Ryzen-7950X CPU
|
Beta Was this translation helpful? Give feedback.
0 replies
-
Here some performance numbers for the 3B model on a Ryzen-7950X CPU
|
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
LLaMA-3.2 is out.
llama.cpp
does not yet support the vision models, so this post focuses on the 1B ad 3B text models that could be very handy for local usage on low-end devices. The models are small enough even with full precision (bf16
) but I think it is still interesting to look at quantization as token generation is significantly faster with quantized models.To reproduce the results reported here
Perplexity
Perplexity (
PPL
in what follows) is not the best measure to compare different models, but it is extremely useful when comparing a quantized version of a model to the same full precision model. In the graphs below I use the quantization error defined aswhere
PPL(Q)
is the perplexity of quantizationQ
andPPL(bf16)
is the perplexity of the full model (the 3.2 models are released asbf16
, so I usebf16
throughout asbf16
support has been added here in PR #39, #40, #41, #56).The following graph shows quantization error of LLaMA-3.2-3B as a function of bits-per-weight (bpw) for (almost) all quantization types supported here. Note that this is the effective bpw that includes the
token_embedding.weight
tensor, which is quantized with more bits (typicallyQ6_K
), and this has a significant impact on the overall bpw balance as this tensor represents a significant fraction of the overall model size. The y-axis is logarithmic, so differences can be quite large even if data points look relatively close. The cyan circles are for the new quantsIQ2_K, IQ3_K, IQ4_K, IQ5_K
andIQ6_K
that are not available in mainlinellama.cpp
. The black symbols are for i-quants, the red for k-quants, and the blue symbols are legacy quants (Q4_0, Q4_1, Q5_0
, Q5_1`).The next graph shows results for LLaMA-3.2-3B-Instruct. The results are qualitatively very similar to the base model, with the quantization error being slightly lower compared to the base model.

My conclusion from these two graphs are
IQ4_K
andIQ5_K
are significantly better than k- or legacy quants in this bpw rangeThe next graph is for the base LLaMA-3.2-1B model
Here the quantization error is significantly larger, going below 2% only for 5+ bpw. At about 4.95 bpw
IQ4_K
has a quantization error of 3%,Q4_K_S
is at 4.3%, andQ4_0
at 12.5% (!), nearly the same asIQ3_K
at 3.68 bpw.HellaSwag
The HellaSwag 0-shot score of 74.34 for the 3B base model is surprisingly high for a model of this size. But here we are more interested in looking at the impact of quantization, so I'll focus on that. The following graph shows
for LLaMA-3.2-3B.
As one could have expected from the perplexity results, sub-3-bpw quantization destroys the model utility. Hence, it is more useful to focus on the 3+ bpw range, which is the purpose of the next graph
We see that
IQ4_K, IQ5_K, IQ6_K
andQ6_K
are basically indistinguishable from thebf16
model for the HellaSwag metrics. But at less than 2 points belowbf16
, evenIQ3_K
andIQ3_S
could be useful if HellaSwag is representative for the kind of tasks one intends to tackle.MMLU
Here I show only results for the 3+ bpw range for LLaMA-3.2-3B in the following graph
All quantizations above
IQ3_K
(3.6 bpw) are (nearly) indistinguishable from the fullbf16
model according to this metrics.Beta Was this translation helpful? Give feedback.
All reactions