Replies: 2 comments 3 replies
-
I think that before we put in the effort to implement mixed precision within a tensor we should wait for more data regarding how well the new ExLlama format actually performs. Unless I am misunderstanding something the format is effectively doing an optimization with a very high number of parameters (with one parameter being one of the precisions within a tensor) and as such there is a substantial risk of overfitting. There will also be issues with combing mixed precision within a tensor with the current implementation of mul_mat_q because the tile sizes (and the corresponding shared memory use) are different between the quantization formats. For mul_mat_q in its current form to work properly the quantization format within a row and across blocks of 128 rows would need to be the same. Unfortunately this would greatly limit the granularity with which you can set a precision. If you were to instead resort to dequantizing the weight matrix to f16/f32 and using cuBLAS GEMM then the savings in VRAM from the mixed precision weights may be less than the VRAM needed for the temporary buffer to hold the dequantized weights. |
Beta Was this translation helpful? Give feedback.
-
Thank you for this clear explanation. I wonder how much we could still gain with a "Q2_K_/S/M/L system and the potential additional granularity allowed within the constraints of the MMQ implementation, and would it be worth it in terms of perplexity growth compared to the 33/34b models. Shallow note : I like the ExLlamaV2 notation, though, because LlamaCPP Q2_K is closer to a "Q3_XS" actually. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Exllama V2 can now load 70b models on a single RTX 3090/4090. Tested with success on my side in Ooba in a "Q_2.55bpw_K" with 2048 ctx.
I'd love to see such thing on LlamaCPP, especially considering the experience already gained about the currant K_Quants in terms of relative importance of each weight in terms of peplexity gained/lost relatively to its quantization.
Also, in combination with the KV-Q8_0 work of Johannes Gaessler, such LlamaCPP would be quite the milestone for casual users!
Beta Was this translation helpful? Give feedback.
All reactions