You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The LLama 3.1 paper notes that they found that naive FP8 quantization performs worse when you evaluate reward scores than their BF16 implementation and developed some specific approaches for their FP8 quantization (for 405B).
I'm wondering if it's worth testing these to see if they'll enhance GGUF's Q8_0 or even the Q5/Q6 quants. Of course they'll entail a corresponding increase in filesize/VRAM usage, but it'll be interesting to see if a Q8_enhanced is worth the increase. I didn't see another discussion of this, so I figure it's worth enumerating the things that might be worth trying:
The methods called out in the paper are:
Leave the first and last layers unquantized in BF16 (or perhaps FP16 or Q8 if BF16 is not supported in GGUF)
Upper bound the dynamic scaling factors to 1200 (not super sure how that would translate for us)
We use row-wise quantization, computing scaling factors across rows for parameter and activation
matrices (I need to catch up on how we're createing blocks for the Q quants ... may not make sense for us)
I assume we'd see lifts in KL divergence with these methods vs. the BF16 implementations and keeping the top and bottom layers in BF16 can't be too big a hit for 3.1-8B. I also wonder if keeping the top and bottom layers in Q8 will help significantly improve the Q4/Q5 quant at not too big a hit.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
The LLama 3.1 paper notes that they found that naive FP8 quantization performs worse when you evaluate reward scores than their BF16 implementation and developed some specific approaches for their FP8 quantization (for 405B).
I'm wondering if it's worth testing these to see if they'll enhance GGUF's Q8_0 or even the Q5/Q6 quants. Of course they'll entail a corresponding increase in filesize/VRAM usage, but it'll be interesting to see if a Q8_enhanced is worth the increase. I didn't see another discussion of this, so I figure it's worth enumerating the things that might be worth trying:
The methods called out in the paper are:
matrices (I need to catch up on how we're createing blocks for the Q quants ... may not make sense for us)
I assume we'd see lifts in KL divergence with these methods vs. the BF16 implementations and keeping the top and bottom layers in BF16 can't be too big a hit for 3.1-8B. I also wonder if keeping the top and bottom layers in Q8 will help significantly improve the Q4/Q5 quant at not too big a hit.
Beta Was this translation helpful? Give feedback.
All reactions