Perplexity / PPL, as a quantization loss benchmark, is inaccurate - KL divergence seem to be a better data point #4110

kalomaze · 2023-11-17T10:07:12Z

kalomaze
Nov 17, 2023

Perplexity is a very rough measurement for seeing how much quantization actually changes the final output of the model.
I propose using a metric that compares the changes of the percentages for the output tokens, since the similarity there seems to directly correlate with perceived quantization loss.

This may also be a beneficial metric for improving k-quant configurations. In any case, it seems much more reliable to take something like the top 5 k and compare the percentages.

kalomaze · 2023-11-17T10:33:15Z

kalomaze
Nov 17, 2023
Author

I have also opened a PR about how Mistral quantization is potentially lacking an optimization for GQA.
This style of metric could be put to the test:
#4111

0 replies

kalomaze · 2023-11-17T11:00:10Z

kalomaze
Nov 17, 2023
Author

Layer skipping was an interesting experiment that was considered inconclusive. Maybe having a more consistent overall datapoint on how the model changes in response to layer skipping would be ideal for resuming it. In fact, it might be interesting to see what KL probabilities changed the most compared to the least (if we test across wide texts); it might be a step towards interpretability for hidden layers.
#3565

4 replies

BarfingLemurs Nov 18, 2023

For k-quants: https://github.com/KerfuffleV2/gguf-tools#gguf-frankenstein

(I don't know if there is a gpu performance penalty from variable k-quants in one model)

Guy skips 4-5 layers to fit 33B in RAM: #3318 (reply in thread)

KerfuffleV2 Nov 18, 2023
Collaborator

For k-quants: https://github.com/KerfuffleV2/gguf-tools#gguf-frankenstein

While I like to see my stuff linked, I'm not sure what gguf-frankenstein could do here. It's for taking one model's metadata and another model's tensor data and stitching them together into a new model. You can't (currently, anyway) use it to do something like skip some layers or add extra ones.

I'd like to come back to the layer skipping thing (there's currently a lot of random experimental stuff in that pull which probably should get removed) but the big problem there was it seemed to make the graph memory size calculation break and required weird hacks to get around that.

I'd love to find a more general solution and just get the backend support for being able to skip layers (or partial layers) in.

BarfingLemurs Nov 18, 2023

@KerfuffleV2 Do you have a tool for assembly with different k-quants per layer? (q_2_k , q_3_k, q_4_k, q_5_k, q_6_k)

KerfuffleV2 Nov 18, 2023
Collaborator

@BarfingLemurs

Do you have a tool for assembly with different k-quants per layer?

Not currently. I'm not quick sure how that would work with scripts like in gguf-tools. Like, you'd have to do something like quantize into all those formats and then give the script all the models and tell it how you want them combined. That seems needlessly complicated.

I think what you actually want is a way to tell the quantize tool what to do so it can just quantizing according to those rules in the first place.

I don't know if there is a gpu performance penalty from variable k-quants in one model

By the way, it already is like that for most of the kquants. For example, q4_k_m quantizes some tensors with q4_k, and some with q6_k (what its heuristic deems more important/sensitive to being quantized).

So as far as I know, there wouldn't be an issue with using whatever random quantizations you want for any 2D tensors when quantizing. Hmm, I think it actually wouldn't be too hard to do something like define quantization overrides with a file like:

tensorname q2_k
tensorname q4_k

If there was a definition for every tensor in the model, then nothing would get chosen automatically so you would have full control.

KerfuffleV2 · 2023-11-18T08:47:48Z

KerfuffleV2
Nov 18, 2023
Collaborator

In your scaled divergence chart, if you started with the Q2_K model then wouldn't the full quality model show up as the highest divergence? So it just tells you there's a difference, not really whether the difference is good or bad. Perplexity tells you how accurately the model predicted some standard text.

By the way, did you already see #2875?

1 reply

kalomaze Nov 18, 2023
Author

The divergence is measured as compared to the fp16 probabilities.
I have not seen the issue you linked, I’ll take a look though

Perplexity / PPL, as a quantization loss benchmark, is inaccurate - KL divergence seem to be a better data point #4110

Uh oh!

kalomaze Nov 17, 2023

Replies: 3 comments · 5 replies

Uh oh!

Uh oh!

kalomaze Nov 17, 2023 Author

Uh oh!

kalomaze Nov 17, 2023 Author

Uh oh!

BarfingLemurs Nov 18, 2023

Uh oh!

Uh oh!

KerfuffleV2 Nov 18, 2023 Collaborator

Uh oh!

BarfingLemurs Nov 18, 2023

Uh oh!

KerfuffleV2 Nov 18, 2023 Collaborator

Uh oh!

KerfuffleV2 Nov 18, 2023 Collaborator

Uh oh!

kalomaze Nov 18, 2023 Author

kalomaze
Nov 17, 2023

Replies: 3 comments 5 replies

kalomaze
Nov 17, 2023
Author

kalomaze
Nov 17, 2023
Author

KerfuffleV2 Nov 18, 2023
Collaborator

KerfuffleV2 Nov 18, 2023
Collaborator

KerfuffleV2
Nov 18, 2023
Collaborator

kalomaze Nov 18, 2023
Author