Skip to content

Quantization improvements (2) #302

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Apr 1, 2025
Merged

Quantization improvements (2) #302

merged 2 commits into from
Apr 1, 2025

Conversation

ikawrakow
Copy link
Owner

This PR is a follow up of #295. It applies the same approach to type-1 quants (Q2_K, Q4_K, Q5_K, Q4_1, Q5_1) and to IQ3_K. Quantization speed for IQ3_K is improved by a significant margin (up to 40%). Quantization speed for type-1 quants is also slightly improved ($\le 15$%). The changes do not result in PPL improvement for all tested models, but do improve PPL for the models that are more difficult to quantize (e.g., the LLaMA-3 series of models), and avoid a near catastrophic failure of IQ3_K on DeepSeek-Lite.

The following table shows PPL comparisons between the main branch and this PR for LLaMA-v1-7B1(L1-7B in the table), LLaMA-v2-7B1 (L2-7B), Mistral-7B1 (M-7B), LLaMA-3.1-8B-Instruct (L3-8B), and DeepSeek-V2-Lite (DSL). Context is always 512 tokens. Also given are the quantization times (Q-time for short in the table) in seconds on a Ryzen-7950X CPU. Tested is "pure" quantization (i.e., using the --pure option of llama-quantize) with token embeddings and output tensor set to Q8_0. The quantization command line is

./bin/llama-quantize --imatrix $imatrix --token-embedding-type q8_0 --output-tensor-type q8_0 --pure $model $output $quant
Model Quantization PPL (main) PPL (this PR) Q-time (main) Q-time (this PR)
L1-7B Q4_1 5.9773 5.9760 N/A2 N/A2
L2-7B Q4_1 5.8676 5.8691 33.6 29.9
M-7B Q4_1 5.7452 5.7471 36.7 32.3
L3-8B Q4_1 7.5309 7.5277 38.1 34.0
DSL Q4_1 6.8639 6.8584 84.1 75.3
L1-7B Q5_1 5.9183 5.9182 N/A2 N/A2
L2-7B Q5_1 5.8164 5.8175 35.6 30.8
M-7B Q5_1 5.7067 5.7074 37.6 33.6
L3-8B Q5_1 7.3749 7.3759 38.7 34.7
DSL Q5_1 6.7881 6.7875 86.4 76.5
L1-7B Q2_K 7.3154 7.2989 N/A2,3 N/A2
L2-7B Q2_K 7.3044 7.2558 36.4 32.2
M-7B Q2_K 6.9507 6.9273 38.4 35.0
L3-8B Q2_K 11.546 11.458 40.1 36.5
DSL Q2_K 8.3822 8.3346 89.6 83.4
L1-7B Q4_K 5.9801 5.9779 N/A2 N/A2
L2-7B Q4_K 5.8675 5.8673 34.1 30.7
M-7B Q4_K 5.7449 5.7406 37.0 32.8
L3-8B Q4_K 7.5192 7.5157 38.2 34.5
DSL Q4_K 6.8607 6.8570 75.7 68.5
L1-7B Q5_K 5.9314 5.9299 N/A2 N/A2
L2-7B Q5_K 5.8144 5.8196 35.6 31.2
M-7B Q5_K 5.7030 5.7064 37.3 34.1
L3-8B Q5_K 7.3941 7.3812 38.9 34.6
DSL Q5_K 6.7929 6.7903 76.5 69.5
L1-7B IQ3_K 6.1393 6.1377 N/A2 N/A2
L2-7B IQ3_K 6.0251 6.0227 44.7 36.9
M-7B IQ3_K 5.8835 5.8855 54.6 39.5
L3-8B IQ3_K 7.9148 7.9189 56.3 41.4
DSL IQ3_K 7.3143 7.0409 116.4 92.5

1 Why use such ancient models? The LLaMA-v1 models were the basis for k-quants development. I-quants were developed using LLaMA-v1, LLaMA-v2 and Mistral-7B. In my experience, if a quantization technique does well on all 3 of these, it is (almost) guaranteed to do well on any other model out there.

2 I have this model on an old HDD. In this case quantization time is dominated by the time needed to read the data from the HDD. I could have copied the model to the SSD drive, but I think the timing for the other models gives enough indication of the relative performance.

Iwan Kawrakow added 2 commits March 29, 2025 09:12
Not much of a difference for most models, but this change
avoids what it looks like a catastrophic failure for DeepSeek-Lite
(PPL is now 7.041 vs 7.314 on main).
@ikawrakow ikawrakow merged commit 190e786 into main Apr 1, 2025
@saood06
Copy link
Collaborator

saood06 commented Apr 2, 2025

and avoid a near catastrophic failure of IQ3_K on DeepSeek-Lite.

Interestingly IQ3_K before this PR was actually worse than Q3_K before #295 for DSL.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants