Apply Llama-3.1 learnings about quantization to GGUF quants? #8696

kinchahoy · 2024-07-25T22:45:37Z

kinchahoy
Jul 25, 2024

The LLama 3.1 paper notes that they found that naive FP8 quantization performs worse when you evaluate reward scores than their BF16 implementation and developed some specific approaches for their FP8 quantization (for 405B).

I'm wondering if it's worth testing these to see if they'll enhance GGUF's Q8_0 or even the Q5/Q6 quants. Of course they'll entail a corresponding increase in filesize/VRAM usage, but it'll be interesting to see if a Q8_enhanced is worth the increase. I didn't see another discussion of this, so I figure it's worth enumerating the things that might be worth trying:

The methods called out in the paper are:

Leave the first and last layers unquantized in BF16 (or perhaps FP16 or Q8 if BF16 is not supported in GGUF)
Upper bound the dynamic scaling factors to 1200 (not super sure how that would translate for us)
We use row-wise quantization, computing scaling factors across rows for parameter and activation
matrices (I need to catch up on how we're createing blocks for the Q quants ... may not make sense for us)

I assume we'd see lifts in KL divergence with these methods vs. the BF16 implementations and keeping the top and bottom layers in BF16 can't be too big a hit for 3.1-8B. I also wonder if keeping the top and bottom layers in Q8 will help significantly improve the Q4/Q5 quant at not too big a hit.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Apply Llama-3.1 learnings about quantization to GGUF quants? #8696

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Apply Llama-3.1 learnings about quantization to GGUF quants? #8696

Uh oh!

Uh oh!

kinchahoy Jul 25, 2024

Replies: 0 comments

kinchahoy
Jul 25, 2024