Better quantized models for Mistral-7B #4364
Replies: 2 comments
-
In my opinion, perplexity isn't a very accurate datapoint for comparing quantization in contrast to comparing the probabilities directly across quants (via KL divergence). Interestingly, you can also measure the highest reported divergence for any particular token via that method and see that there are often outliers with strong differences on lower quantizations (these images are for the current quantization methods for Mistral in mainline): I think more effort should be put towards getting better data points for comparing quantization impact instead of relying on perplexity differences. |
Beta Was this translation helpful? Give feedback.
-
Thanks a lot for your efforts! You've really helped optimising performance in the past. Can't wait to see how your project is progressing. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
I have been working on improved quantization methods in a private clone of
llama.cpp
. While I'm not quite ready yet to publicly release the results, I'm curious to hear from others if the perplexity improvements I observe translate into actual benefits in practical use. Hence, I have decided to publish the improved quantized models for Mistral-7B on Huggingface in this repository.The quantized models are fully compatible with the current
llama.cpp
, so can be used out-of-the-box.The quantization approach for these models differs from what is available in
llama.cpp
by the usage of an "importance matrix", which is used as weights in a weighted MSE minimization when preparing the quants. The "importance matrix" is obtained via a "calibration run" that uses some training dataset (I have used the training part of Wikitext).The following table shows a perplexity comparison between the improved quantized models and the current
llama.cpp
quants. The improvement in perplexity decreases with middle size, so I have not addedQ5_K_M
andQ6_K
models to the comparison. The values in theError
columns are defined as(PPL(quantized model) - PPL(fp16))/PPL(fp16)
. All perplexity are for a context size of 512.Given the interest in very small models, I have added an "extra small" 2-bit quantization with a model size of 2.47 GB (2.3 GiB). Except for
output.weight
(usesQ6_K
) andattn_v.weight
(usesQ4_K
), all other tensors are quantized withQ2_K
(so, 2.5625 bits per weight). The perplexity of this model for a context size of 512 is6.7099
, decreasing to5.5744
for a context length of 4096.Beta Was this translation helpful? Give feedback.
All reactions