Quantization improvements (2) #302

ikawrakow · 2025-03-31T14:37:39Z

This PR is a follow up of #295. It applies the same approach to type-1 quants (Q2_K, Q4_K, Q5_K, Q4_1, Q5_1) and to IQ3_K. Quantization speed for IQ3_K is improved by a significant margin (up to 40%). Quantization speed for type-1 quants is also slightly improved ($\le 15$%). The changes do not result in PPL improvement for all tested models, but do improve PPL for the models that are more difficult to quantize (e.g., the LLaMA-3 series of models), and avoid a near catastrophic failure of IQ3_K on DeepSeek-Lite.

The following table shows PPL comparisons between the main branch and this PR for LLaMA-v1-7B¹(L1-7B in the table), LLaMA-v2-7B¹ (L2-7B), Mistral-7B¹ (M-7B), LLaMA-3.1-8B-Instruct (L3-8B), and DeepSeek-V2-Lite (DSL). Context is always 512 tokens. Also given are the quantization times (Q-time for short in the table) in seconds on a Ryzen-7950X CPU. Tested is "pure" quantization (i.e., using the --pure option of llama-quantize) with token embeddings and output tensor set to Q8_0. The quantization command line is

./bin/llama-quantize --imatrix $imatrix --token-embedding-type q8_0 --output-tensor-type q8_0 --pure $model $output $quant

Model	Quantization	PPL (main)	PPL (this PR)	Q-time (main)	Q-time (this PR)
L1-7B	Q4_1	5.9773	5.9760	N/A²	N/A²
L2-7B	Q4_1	5.8676	5.8691	33.6	29.9
M-7B	Q4_1	5.7452	5.7471	36.7	32.3
L3-8B	Q4_1	7.5309	7.5277	38.1	34.0
DSL	Q4_1	6.8639	6.8584	84.1	75.3
L1-7B	Q5_1	5.9183	5.9182	N/A²	N/A²
L2-7B	Q5_1	5.8164	5.8175	35.6	30.8
M-7B	Q5_1	5.7067	5.7074	37.6	33.6
L3-8B	Q5_1	7.3749	7.3759	38.7	34.7
DSL	Q5_1	6.7881	6.7875	86.4	76.5
L1-7B	Q2_K	7.3154	7.2989	N/A^2,3	N/A²
L2-7B	Q2_K	7.3044	7.2558	36.4	32.2
M-7B	Q2_K	6.9507	6.9273	38.4	35.0
L3-8B	Q2_K	11.546	11.458	40.1	36.5
DSL	Q2_K	8.3822	8.3346	89.6	83.4
L1-7B	Q4_K	5.9801	5.9779	N/A²	N/A²
L2-7B	Q4_K	5.8675	5.8673	34.1	30.7
M-7B	Q4_K	5.7449	5.7406	37.0	32.8
L3-8B	Q4_K	7.5192	7.5157	38.2	34.5
DSL	Q4_K	6.8607	6.8570	75.7	68.5
L1-7B	Q5_K	5.9314	5.9299	N/A²	N/A²
L2-7B	Q5_K	5.8144	5.8196	35.6	31.2
M-7B	Q5_K	5.7030	5.7064	37.3	34.1
L3-8B	Q5_K	7.3941	7.3812	38.9	34.6
DSL	Q5_K	6.7929	6.7903	76.5	69.5
L1-7B	IQ3_K	6.1393	6.1377	N/A²	N/A²
L2-7B	IQ3_K	6.0251	6.0227	44.7	36.9
M-7B	IQ3_K	5.8835	5.8855	54.6	39.5
L3-8B	IQ3_K	7.9148	7.9189	56.3	41.4
DSL	IQ3_K	7.3143	7.0409	116.4	92.5

¹ Why use such ancient models? The LLaMA-v1 models were the basis for k-quants development. I-quants were developed using LLaMA-v1, LLaMA-v2 and Mistral-7B. In my experience, if a quantization technique does well on all 3 of these, it is (almost) guaranteed to do well on any other model out there.

² I have this model on an old HDD. In this case quantization time is dominated by the time needed to read the data from the HDD. I could have copied the model to the SSD drive, but I think the timing for the other models gives enough indication of the relative performance.

Not much of a difference for most models, but this change avoids what it looks like a catastrophic failure for DeepSeek-Lite (PPL is now 7.041 vs 7.314 on main).

saood06 · 2025-04-02T10:55:25Z

and avoid a near catastrophic failure of IQ3_K on DeepSeek-Lite.

Interestingly IQ3_K before this PR was actually worse than Q3_K before #295 for DSL.

Iwan Kawrakow added 2 commits March 29, 2025 09:12

iq3_k: slightly better quantization

5686031

Not much of a difference for most models, but this change avoids what it looks like a catastrophic failure for DeepSeek-Lite (PPL is now 7.041 vs 7.314 on main).

Small improvement for type-1 quants

86513a6

ikawrakow merged commit 190e786 into main Apr 1, 2025

saood06 mentioned this pull request Apr 30, 2025

README language is vague wrt. "quantization improvements" #362

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Quantization improvements (2) #302

Quantization improvements (2) #302

Uh oh!

ikawrakow commented Mar 31, 2025

Uh oh!

saood06 commented Apr 2, 2025

Uh oh!

Uh oh!

Quantization improvements (2) #302

Quantization improvements (2) #302

Uh oh!

Conversation

ikawrakow commented Mar 31, 2025

Uh oh!

saood06 commented Apr 2, 2025

Uh oh!

Uh oh!