QuIP: 2-Bit Quantization of Large Language Models With Guarantees #2510

abceleung · 2023-08-04T03:34:34Z

abceleung
Aug 4, 2023

Just found this in r/localllama. Would this be useful?

klosax · 2023-08-04T09:26:09Z

klosax
Aug 4, 2023

" Empirically, we find that our incoherence preprocessing improves several existing quantization algorithms and yields the first LLM quantization methods that produce viable results using only two bits per weight."

Code here: https://github.com/jerry-chee/QuIP

2 replies

Green-Sky Aug 4, 2023
Collaborator

ignores the fact that we have had k-quants 2bit quantization for 2 months.

LLAMA_FTYPE_MOSTLY_Q2_K - uses GGML_TYPE_Q4_K for the attention.vw and feed_forward.w2 tensors, GGML_TYPE_Q2_K for the other tensors.

KerfuffleV2 Aug 4, 2023
Collaborator

ignores the fact that we have had k-quants 2bit quantization for 2 months.

What are you basing that on? It doesn't seem like it's claiming the idea 2 bit quantization is unique. The interesting thing here is it's claiming to get results approaching 16 bit which we can't really say about Q2_K or Q3_K (or even the "MOSTLY_" versions).

From my attempts at playing with it (admittedly not exactly comprehensive), my impression was the quality level for Q2/Q3 models was really to low to be worth using even with large amounts of parameters. In fact, based on the perplexity graphs for the Q2/Q3 quants if I recall correctly, adding parameters didn't really help which QuIP claims that as parameters increase the result approaches 16 bit quality.

Whether the claims match the reality is obviously a different problem and I don't have an opinion on it at this point.

Dampfinchen · 2023-08-04T15:52:48Z

Dampfinchen
Aug 4, 2023

Yeah this looks to be very promising. Much better than the current q2 and q3 implemented in llama.cpp right now.

Edit: ignore my ramblings. I have the tendency to go overboard with hype when I see new stuff, rather than actually evaluating the data at hand. Sorry for that.

0 replies

hilarious-viking · 2023-08-04T18:02:17Z

hilarious-viking
Aug 4, 2023

@ikawrakow what do you think about ?

3 replies

ikawrakow Aug 4, 2023

Looks like @Dampfinchen and @KerfuffleV2 find this much better than the 2- and 3-bit quantization available in llama.cpp. Im not sure how they arrived at this assessment. If I look at Table 1 in the paper and focus on column 1, which seem to be perplexity on Wikitext, I observe that their 2-bit quantization is 5 times higher (!!!) than fp16. In comparison, Q2_K is ~16% higher than fp16. Their nearest neighbour perplexity is 40000. This does not make much sense to me. Am I missing something?

slaren Aug 4, 2023
Maintainer

If I am not mistaken, the method described in the paper is in the right part of the table. Still ~20% higher than fp16, though.

hilarious-viking Aug 4, 2023

if we compared to f16 to Q2_K for 13B, 11.4% (data from: #1684: 5.2543 vs 5.8545) vs QuIP w2 19.5%, Q2_K wins
but for 3bits Q3_K_S 6.64% and Q3_K_L 2.89% vs QuIP w3 1.97%
@ikawrakow

KerfuffleV2 · 2023-08-04T19:49:42Z

KerfuffleV2
Aug 4, 2023
Collaborator

and @KerfuffleV2 find this much better than the 2- and 3-bit quantization available

Sorry if I wasn't clear, but that's not what I was trying to say. Their claim is that it approaches 16 bit results, and also that as the parameter size of the model increases, so does the accuracy (an effect that doesn't seem to occur with GGML quantizations). Like I said at the end of my post, whether the claims are accurate isn't something I really have an opinion. So my post was basically "If what they say checks out, then this looks good".

If what I said gave the impression of disparaging k-quants then I apologize: that's not what I intended.

Their nearest neighbour perplexity is 40000. This does not make much sense to me. Am I missing something?

You might be missing something here. The stuff under "baseline processing" like the other person linked isn't using their method. The stuff under "incoherence processing" is - at least partially. I.E. it may be someone else's quantization approach using their rounding algorithm.

I believe "QuIP" under the "Incoherence Processing (ours)" is what the paper and its claims are mainly referring to.

0 replies

Dampfinchen · 2023-08-06T10:03:22Z

Dampfinchen
Aug 6, 2023

Someone is already working porting it to LLama:

https://github.com/AlpinDale/QuIP-for-Llama

2 replies

KerfuffleV2 Aug 6, 2023
Collaborator

Unfortunately, this probably isn't really going to help llama.cpp. It's written in Python, so it'll just be using the exist QuIP code for quantizing/dequantizing. The challenge for llama.cpp is reimplementing the algorithm in C/C++ (and also doing so with acceptable performance on CPU).

logikstate Aug 6, 2023

We would at least be able to tell if it performs better if we have a working python version.

CyborgArmy83 · 2023-08-14T09:50:11Z

CyborgArmy83
Aug 14, 2023

The 3-bit curves seem very interesting, because this could potentially be a successor to the K, K_M and K_S quantizations now available. If I am reading the paper correctly, 3 bit QuIP could result in better performance (lower perplexity) than current Q4_K_M while also consuming less VRAM. But of course we have seen similar promises before with SqueezeLLM.

I hope QuIP is not just fantasy for LLama at this point. Curious to read what other think about it.

1 reply

yiliu30 Aug 15, 2023

Has anyone added the SqueezeLLM in llama.cpp?

QuIP: 2-Bit Quantization of Large Language Models With Guarantees #2510

Uh oh!

Replies: 6 comments · 8 replies

Uh oh!

Uh oh!

Uh oh!

Green-Sky Aug 4, 2023 Collaborator

Uh oh!

KerfuffleV2 Aug 4, 2023 Collaborator

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

slaren Aug 4, 2023 Maintainer

Uh oh!

Uh oh!

Uh oh!

KerfuffleV2 Aug 4, 2023 Collaborator

Uh oh!

Uh oh!

KerfuffleV2 Aug 6, 2023 Collaborator

Uh oh!

Uh oh!

Uh oh!

Replies: 6 comments 8 replies

Green-Sky Aug 4, 2023
Collaborator

KerfuffleV2 Aug 4, 2023
Collaborator

slaren Aug 4, 2023
Maintainer

KerfuffleV2
Aug 4, 2023
Collaborator

KerfuffleV2 Aug 6, 2023
Collaborator