Q4_K Quantization Scheme adaptation #6760

wilderfield · 2024-04-19T07:49:06Z

wilderfield
Apr 19, 2024

So I see that to dequantize a weight in Q4_K format I have to do:

y = s * q - m

y is the dequantized weight (float)
s is the scale (float)
q is the quantized weight (int4)
m is the zero point offset (float)

Now the challenge is that I have hardware that expects the following scheme:

y = s * (q - z)

The difference from above is that z is the zero point (int4)

This is also the scheme that pytorch uses: https://pytorch.org/blog/quantization-in-practice/

I want to use the parameters from llama.cpp with my hardware, so I try to do some math...

Set the two equations equal to each other, and solve for z.

I get z = round(m/s)

When I simulate this adaptation, I get catastrophic accuracy loss, even without hardware involved.

Is there something fundamentally wrong with this math?

Is it not possible to reconcile these two quantization schemes?

wilderfield · 2024-04-19T15:49:12Z

wilderfield
Apr 19, 2024
Author

@ikawrakow I’m wondering if you might be able to help me with this math/quantization question. If you have more important things to do I completely understand.

3 replies

ikawrakow Apr 19, 2024

If you want to use y = s * (q - z) where q and z are both int4, you are basically looking at something similar to a Q4_0 quantization (being exactly Q4_0 if z = 8). The whole point of Q4_K is that the offset from zero being used has a better precision. If you want to still try with Q4_K, you need to scale the quants up (hopefully your hardware can operate efficiently on int8_t's). I.e.,

y = s * q - m = s * (q - m/s) = s/8 * (8*q - 8*m/s)
=> use float scale s' = s/8 
=> compute y = s' ((q << 3) - z), where z = round(8*m/s)

This will work most of the time, but you need to be careful with overflow of 8*m/s (the q's are in 0...15, so 8*q is in the allowed range of a signed 8-bit integer, so no need to worry about that). In the normal situation where model weights are distributed (nearly) symmetrically around zero, we will have s ~ 2*m/15, so 8*m/s ~ 60. But occasionally blocks of model weights are highly asymmetric, and you need to be careful in those cases. E.g., consider weights for the block being in -1...-0.5. In that case m = 1, s = 0.5/15, so 8*m/s = 8/(0.5/15) = 240, so int8_t no longer works.

wilderfield Apr 19, 2024
Author

Thank you this is very helpful feedback. I'll start investigating Q4_0.

wilderfield Jul 5, 2024
Author

FYI, this is very late, but I was able to get Q4_0 working on my hardware. Thank you for the help above!!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Q4_K Quantization Scheme adaptation #6760

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Q4_K Quantization Scheme adaptation #6760

Uh oh!

Uh oh!

wilderfield Apr 19, 2024

Replies: 1 comment · 3 replies

Uh oh!

wilderfield Apr 19, 2024 Author

Uh oh!

ikawrakow Apr 19, 2024

Uh oh!

wilderfield Apr 19, 2024 Author

Uh oh!

wilderfield Jul 5, 2024 Author

wilderfield
Apr 19, 2024

Replies: 1 comment 3 replies

wilderfield
Apr 19, 2024
Author

wilderfield Apr 19, 2024
Author

wilderfield Jul 5, 2024
Author