explanation of the choice behind selective quantization + IQ quant grid #14091
Replies: 1 comment 2 replies
-
src/llama-quant.cpp is where the quantization layer logic lives. This file is full of a bunch of empirical heuristics which modify the quantization of various tensors as a function of layer number and selected quant for file. If you want to switch things up from the default heuristics this is where the mods will go, or you can try overriding with the regexp thing recently added (I don't use it, I found it much easier to directly modify llama-quant.cpp for my needs). I have been investigating hybrid layer quants (using less bits at deep layers and more bits at cortex layers) for some of the SOTA models and finding quite good results. The hybrid quant models can be made significantly smaller than homogeneous layer quants while performing just as well in my tests. Started with Llama scout and moving to Qwen 3, Mistral Small and Gemma 3. I summarize the layer quants I am using in each model on my model page at https://huggingface.co/steampunque. Original Scout experiment discussion #13040. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
A few questions-
For any quantization type (K-quant or IQ quant), different layers of a particular model are quantized to different bits such as, output layer of Q2_K is Q6_K and IQ1_M is Q5_k for llama 70b model. My question is how was this choice made? Was it empirical? And where in the codebase is the defined that say quantizing to Q2_K means this and this layers will be of this and this type (its not in
quantize.cpp
). How can I make my custom choice of quantized layers (using the--tensor-type
)?In IQ quant, (in
ggml-quants.c
), thekgrid_2bit_256
and other grid arrays are defined. How was this grid created?I find mixed answers for whether K-quants employ K-means clustering or not, from the code I dont see any clustering implementation, what is
K
supposed to mean here?Beta Was this translation helpful? Give feedback.
All reactions