General questions around quant methods and types #6561
younesbelkada
started this conversation in
General
Replies: 1 comment
-
Found out #5063 for more elaborated information |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Hi everyone,
First of all thanks for this great project and the amount of very nice information that we can get through the discussions and PRs.
I would like to get started at understanding how the underlying quantization methods work in llama.cpp, I might miss important details so please correct me at any time!
I started to learn more about the internals of gguf quants here: #1684 and my questions are mostly about 3/4 bits quantization schemes and not the recent addition of 1-bit quants.
My understanding of the core building block around GGUF seems to be group-wise quantization, is this correct? In that case are the activations always in half / full precision?
Some quant method seems to quantize the scales as well - do you have a rough idea of the potential overhead that this might introduce vs non-quantizing the scales?
If I understood correctly, different quant schemes are usually combined together - for example the query layer could be quantized with Q3_K but the key layer with Q5_K. I first thought this was architecture-specifc, meaning each arch has its own combination - but this does not seem to be the case - so I was wondering how does the combination of different quant tensors are determined?
e.g. below are two screenshots from two different models that derive from the same base model, which is mistral-7b and as you can see the combination of quant types look different
Below is for TheBloke/CapybaraHermes-2.5-Mistral-7B
Below is for NousResearch/Hermes-2-Pro-Mistral-7B
cc @ikawrakow @ggerganov
Thanks so much and let me know if there is another discussion / issue I might have overlooked !
Beta Was this translation helpful? Give feedback.
All reactions