Doubts regarding 3/4 bit quantization and inference #3429

Darshvino · 2023-10-01T19:44:50Z

Darshvino
Oct 1, 2023

Hi Team,

Thank you for your great contribution.

I had doubts about the 3/4 bit quantization inference, would be great if you could help me understand.

My doubts are:
1.) Whether the inference for 3/4 bit models actually happens in 3/4 bit or it is casted to FP16 during inference on CPU? I mean do you have a CPU kernel to run the inference on 3/4 bits?

2.) Which quantization technique is being used?

Thanks

KerfuffleV2 · 2023-10-02T05:50:23Z

KerfuffleV2
Oct 2, 2023
Collaborator

This isn't a full answer, but hopefully it will still help you.

The quantized data isn't just a list of 3 or 4 bit numbers. The items are divided into chunks and each chunk contains some additional data, like the scale of the items in that chunk. This is because without the scales, representing those items with only 3, 4, etc bits will be much less accurate. To put it a different way, keeping track of the range of numbers that exist in a chunk means those 3 or 4 bit values can be relative to the range of values in that chunk rather than relative to range of values it's possible to express with a 16 bit float.

Generally speaking, this approach applies to all the quantizations and not just q3_k, q4_k, q4_0, etc.

You can look at https://github.com/ggerganov/llama.cpp/blob/master/k_quants.h to see the definitions for the blocks if you're able to read a little C.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Doubts regarding 3/4 bit quantization and inference #3429

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Doubts regarding 3/4 bit quantization and inference #3429

Uh oh!

Darshvino Oct 1, 2023

Replies: 1 comment

Uh oh!

Uh oh!

KerfuffleV2 Oct 2, 2023 Collaborator

Darshvino
Oct 1, 2023

KerfuffleV2
Oct 2, 2023
Collaborator