You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Firstly, they have different calculations for d: const float d = max / -8; const float d = amax / ((1 << 7) - 1);
① The former method make max and d have opposite sign, while the latter ensures d is a positive value. ② Furthermore, The latter method incorporates a -1, which causes the divider to be 127 but not 128.
Secondly, q_4 has: const uint8_t xi0 = MIN(15, (int8_t)(x0 + 8.5f)); ③ which does not use roundf() as q8 does: y[i].qs[j] = roundf(x0);
Specifically, I have a guess about ②. I think absence of -1 in quantize_row_q4_0_reference make the value mapped into [-8, 8], which is a bigger range than [-7, 7], and this may be helpful to obtain a better accuracy.
But for ② and ③, I have no idea. I would be grateful for some guidance! : )
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
I've observed that
llama.cpp
has a subtle difference between q4 and q8 quantization's implementations.quantize_row_q4_0_reference(q5_0 has similar behavior)
https://github.com/ggerganov/llama.cpp/blob/9f6ede19f3cfa50d4a51a5babb056c3f8a450b80/ggml.c#L912-L922
And quantize_row_q8_0_reference
https://github.com/ggerganov/llama.cpp/blob/9f6ede19f3cfa50d4a51a5babb056c3f8a450b80/ggml.c#L1084-L1092
Firstly, they have different calculations for
d
:const float d = max / -8;
const float d = amax / ((1 << 7) - 1);
① The former method make
max
andd
have opposite sign, while the latter ensures d is a positive value. ② Furthermore, The latter method incorporates a-1
, which causes the divider to be 127 but not 128.Secondly, q_4 has:
const uint8_t xi0 = MIN(15, (int8_t)(x0 + 8.5f));
③ which does not use roundf() as q8 does:
y[i].qs[j] = roundf(x0);
Specifically, I have a guess about ②. I think absence of
-1
inquantize_row_q4_0_reference
make the value mapped into [-8, 8], which is a bigger range than [-7, 7], and this may be helpful to obtain a better accuracy.But for ② and ③, I have no idea. I would be grateful for some guidance! : )
Beta Was this translation helpful? Give feedback.
All reactions