Better quantized models for Mistral-7B #4364

ikawrakow · 2023-12-07T16:31:47Z

ikawrakow
Dec 7, 2023

I have been working on improved quantization methods in a private clone of llama.cpp. While I'm not quite ready yet to publicly release the results, I'm curious to hear from others if the perplexity improvements I observe translate into actual benefits in practical use. Hence, I have decided to publish the improved quantized models for Mistral-7B on Huggingface in this repository.

The quantized models are fully compatible with the current llama.cpp, so can be used out-of-the-box.

The quantization approach for these models differs from what is available in llama.cpp by the usage of an "importance matrix", which is used as weights in a weighted MSE minimization when preparing the quants. The "importance matrix" is obtained via a "calibration run" that uses some training dataset (I have used the training part of Wikitext).

The following table shows a perplexity comparison between the improved quantized models and the current llama.cpp quants. The improvement in perplexity decreases with middle size, so I have not added Q5_K_M and Q6_K models to the comparison. The values in the Error columns are defined as (PPL(quantized model) - PPL(fp16))/PPL(fp16). All perplexity are for a context size of 512.

Quantization	PPL(llama.cpp)	Error	PPL(new quants)	Error
Q3_K_S	6.0692	6.62%	6.0021	5.44%
Q3_K_M	5.8894	3.46%	5.8489	2.75%
Q4_K_S	5.7764	1.48%	5.7349	0.75%
Q4_K_M	5.7539	1.08%	5.7259	0.59%
Q5_K_S	5.7258	0.59%	5.7100	0.31%
Q4_0	5.8189	2.23%	5.7924	1.76%
Q4_1	5.8244	2.32%	5.7455	0.94%
Q5_0	5.7180	0.45%	5.7070	0.26%
Q5_1	5.7128	0.36%	5.7057	0.24%

Given the interest in very small models, I have added an "extra small" 2-bit quantization with a model size of 2.47 GB (2.3 GiB). Except for output.weight (uses Q6_K) and attn_v.weight (uses Q4_K), all other tensors are quantized with Q2_K (so, 2.5625 bits per weight). The perplexity of this model for a context size of 512 is 6.7099, decreasing to 5.5744 for a context length of 4096.

kalomaze · 2023-12-07T20:05:08Z

kalomaze
Dec 7, 2023

In my opinion, perplexity isn't a very accurate datapoint for comparing quantization in contrast to comparing the probabilities directly across quants (via KL divergence).

Interestingly, you can also measure the highest reported divergence for any particular token via that method and see that there are often outliers with strong differences on lower quantizations (these images are for the current quantization methods for Mistral in mainline):

I think more effort should be put towards getting better data points for comparing quantization impact instead of relying on perplexity differences.

0 replies

CyborgArmy83 · 2023-12-13T22:48:49Z

CyborgArmy83
Dec 13, 2023

I have been working on improved quantization methods in a private clone of llama.cpp. While I'm not quite ready yet to publicly release the results, I'm curious to hear from others if the perplexity improvements I observe translate into actual benefits in practical use. Hence, I have decided to publish the improved quantized models for Mistral-7B on Huggingface in this repository.

Thanks a lot for your efforts! You've really helped optimising performance in the past. Can't wait to see how your project is progressing.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Better quantized models for Mistral-7B #4364

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Better quantized models for Mistral-7B #4364

Uh oh!

ikawrakow Dec 7, 2023

Replies: 2 comments

Uh oh!

Uh oh!

kalomaze Dec 7, 2023

Uh oh!

CyborgArmy83 Dec 13, 2023

ikawrakow
Dec 7, 2023

kalomaze
Dec 7, 2023

CyborgArmy83
Dec 13, 2023