ExLlama2 kind of quants (Qx.xx_K) #3162

Nexesenex · 2023-09-14T00:36:49Z

Nexesenex
Sep 14, 2023

Exllama V2 can now load 70b models on a single RTX 3090/4090. Tested with success on my side in Ooba in a "Q_2.55bpw_K" with 2048 ctx.

I'd love to see such thing on LlamaCPP, especially considering the experience already gained about the currant K_Quants in terms of relative importance of each weight in terms of peplexity gained/lost relatively to its quantization.

Also, in combination with the KV-Q8_0 work of Johannes Gaessler, such LlamaCPP would be quite the milestone for casual users!

JohannesGaessler · 2023-09-14T13:06:24Z

JohannesGaessler
Sep 14, 2023
Collaborator

I think that before we put in the effort to implement mixed precision within a tensor we should wait for more data regarding how well the new ExLlama format actually performs. Unless I am misunderstanding something the format is effectively doing an optimization with a very high number of parameters (with one parameter being one of the precisions within a tensor) and as such there is a substantial risk of overfitting.

There will also be issues with combing mixed precision within a tensor with the current implementation of mul_mat_q because the tile sizes (and the corresponding shared memory use) are different between the quantization formats. For mul_mat_q in its current form to work properly the quantization format within a row and across blocks of 128 rows would need to be the same. Unfortunately this would greatly limit the granularity with which you can set a precision. If you were to instead resort to dequantizing the weight matrix to f16/f32 and using cuBLAS GEMM then the savings in VRAM from the mixed precision weights may be less than the VRAM needed for the temporary buffer to hold the dequantized weights.

2 replies

aehlke Nov 17, 2023

How has it turned out?

JohannesGaessler Nov 17, 2023
Collaborator

My impression so far is that EXL2 is not noticeably better than the current llama.cpp quantization formats. So I don't think there would be a point in copying the format.

Nexesenex · 2023-09-14T15:54:45Z

Nexesenex
Sep 14, 2023
Author

Thank you for this clear explanation.

I wonder how much we could still gain with a "Q2_K_/S/M/L system and the potential additional granularity allowed within the constraints of the MMQ implementation, and would it be worth it in terms of perplexity growth compared to the 33/34b models.

Shallow note : I like the ExLlamaV2 notation, though, because LlamaCPP Q2_K is closer to a "Q3_XS" actually.

1 reply

BarfingLemurs Sep 17, 2023

(Oh, it seems a 7b Q2_k model's average bpw is higher than I realized)

llm_load_print_meta: model size = 2.63 GiB (3.35 BPW)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ExLlama2 kind of quants (Qx.xx_K) #3162

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

ExLlama2 kind of quants (Qx.xx_K) #3162

Uh oh!

Nexesenex Sep 14, 2023

Replies: 2 comments · 3 replies

Uh oh!

JohannesGaessler Sep 14, 2023 Collaborator

Uh oh!

aehlke Nov 17, 2023

Uh oh!

JohannesGaessler Nov 17, 2023 Collaborator

Uh oh!

Nexesenex Sep 14, 2023 Author

Uh oh!

BarfingLemurs Sep 17, 2023

Nexesenex
Sep 14, 2023

Replies: 2 comments 3 replies

JohannesGaessler
Sep 14, 2023
Collaborator

JohannesGaessler Nov 17, 2023
Collaborator

Nexesenex
Sep 14, 2023
Author