What are the types of quanitzation scheme used in llama.cpp #9547

mrtpk · 2024-09-19T06:43:00Z

mrtpk
Sep 19, 2024

Hello Team,

I have been learning about transformer quantization and is particularly interested in full integer 8-bit quantization. What quantization scheme (like GPTQ, AWQ, SmoothQuant) is supported in full integer quantization in llama.cpp? To give a context, TFLite uses symmetric int8 quantization for calibration and inference.

Or does llama.cpp support any quantization type as long as the format is in GGUF or GGML? I really appreciate any help or pointers. Thanks.

BarfingLemurs · 2024-09-19T21:57:45Z

BarfingLemurs
Sep 19, 2024

https://newsletter.maartengrootendorst.com/p/a-visual-guide-to-quantization

Breakdown for K quantization, but more information about them is in pull requests.

1 reply

mrtpk Sep 26, 2024
Author

Thank you @BarfingLemurs for taking the time to respond; It took me a couple of days to digest all these.

Please correct me if I'm wrong. I understand that in llama.cpp the quantization supported is K-bit quantization. The quantization is done by calculating an importance matrix (it can be hessian matrix if the method is GPTQ). We can employ any quantization scheme like AWQ and use it in llama.cpp if we respect the container format - GGUF or GGML.

One question that I am still trying to figure out is the following:

When the matrix multiplication is done, will the quantized value (value can be in bfloat16, or int8) transformed to float32 for the calculation? or does llama.cpp follow a different calculation mechanism?

From my dig, the quantized value is converted into float32 for calculations. See ggml_vec_dot_q8_0_q8_0.

I'm Looking forward to your response.

Thank you.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

What are the types of quanitzation scheme used in llama.cpp #9547

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

What are the types of quanitzation scheme used in llama.cpp #9547

Uh oh!

mrtpk Sep 19, 2024

Replies: 1 comment · 1 reply

Uh oh!

BarfingLemurs Sep 19, 2024

Uh oh!

Uh oh!

mrtpk Sep 26, 2024 Author

mrtpk
Sep 19, 2024

Replies: 1 comment 1 reply

BarfingLemurs
Sep 19, 2024

mrtpk Sep 26, 2024
Author