Replies: 6 comments 8 replies
-
" Empirically, we find that our incoherence preprocessing improves several existing quantization algorithms and yields the first LLM quantization methods that produce viable results using only two bits per weight." Code here: https://github.com/jerry-chee/QuIP |
Beta Was this translation helpful? Give feedback.
-
Yeah this looks to be very promising. Much better than the current q2 and q3 implemented in llama.cpp right now. Edit: ignore my ramblings. I have the tendency to go overboard with hype when I see new stuff, rather than actually evaluating the data at hand. Sorry for that. |
Beta Was this translation helpful? Give feedback.
-
@ikawrakow what do you think about ? |
Beta Was this translation helpful? Give feedback.
-
Sorry if I wasn't clear, but that's not what I was trying to say. Their claim is that it approaches 16 bit results, and also that as the parameter size of the model increases, so does the accuracy (an effect that doesn't seem to occur with GGML quantizations). Like I said at the end of my post, whether the claims are accurate isn't something I really have an opinion. So my post was basically "If what they say checks out, then this looks good". If what I said gave the impression of disparaging k-quants then I apologize: that's not what I intended.
You might be missing something here. The stuff under "baseline processing" like the other person linked isn't using their method. The stuff under "incoherence processing" is - at least partially. I.E. it may be someone else's quantization approach using their rounding algorithm. I believe "QuIP" under the "Incoherence Processing (ours)" is what the paper and its claims are mainly referring to. |
Beta Was this translation helpful? Give feedback.
-
Someone is already working porting it to LLama: |
Beta Was this translation helpful? Give feedback.
-
The 3-bit curves seem very interesting, because this could potentially be a successor to the K, K_M and K_S quantizations now available. If I am reading the paper correctly, 3 bit QuIP could result in better performance (lower perplexity) than current Q4_K_M while also consuming less VRAM. But of course we have seen similar promises before with SqueezeLLM. I hope QuIP is not just fantasy for LLama at this point. Curious to read what other think about it. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
https://arxiv.org/abs/2307.13304
Just found this in r/localllama. Would this be useful?
Beta Was this translation helpful? Give feedback.
All reactions