Different behaviors between ‘quantize_row_q4_0_reference’ and ‘quantize_row_q8_0_reference’? #3577

rikoras · 2023-10-11T07:44:25Z

rikoras
Oct 11, 2023

I've observed that llama.cpp has a subtle difference between q4 and q8 quantization's implementations.

quantize_row_q4_0_reference(q5_0 has similar behavior)
https://github.com/ggerganov/llama.cpp/blob/9f6ede19f3cfa50d4a51a5babb056c3f8a450b80/ggml.c#L912-L922

And quantize_row_q8_0_reference
https://github.com/ggerganov/llama.cpp/blob/9f6ede19f3cfa50d4a51a5babb056c3f8a450b80/ggml.c#L1084-L1092

Firstly, they have different calculations for d:
const float d = max / -8;
const float d = amax / ((1 << 7) - 1);

① The former method make max and d have opposite sign, while the latter ensures d is a positive value. ② Furthermore, The latter method incorporates a -1, which causes the divider to be 127 but not 128.

Secondly, q_4 has:
const uint8_t xi0 = MIN(15, (int8_t)(x0 + 8.5f));
③ which does not use roundf() as q8 does:
y[i].qs[j] = roundf(x0);

Specifically, I have a guess about ②. I think absence of -1 in quantize_row_q4_0_reference make the value mapped into [-8, 8], which is a bigger range than [-7, 7], and this may be helpful to obtain a better accuracy.

But for ② and ③, I have no idea. I would be grateful for some guidance! : )

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Different behaviors between ‘quantize_row_q4_0_reference’ and ‘quantize_row_q8_0_reference’? #3577

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Different behaviors between ‘quantize_row_q4_0_reference’ and ‘quantize_row_q8_0_reference’? #3577

Uh oh!

Uh oh!

rikoras Oct 11, 2023

Replies: 0 comments

rikoras
Oct 11, 2023