Why do all Q dot product methods use Q8 for non-model tensors? #3477

tjake · 2023-10-04T17:29:18Z

tjake
Oct 4, 2023

Looking at ggml code:

https://github.com/ggerganov/llama.cpp/blob/master/ggml.c#L1675

Why do all the quantized dot products encode working tensors (non-model) into q8 first?
Isn't this more work and memory bandwidth than just sending the F16 or F32 directly?

Thanks!

slaren · 2023-10-04T19:06:05Z

slaren
Oct 4, 2023
Maintainer

These tensors are usually very small, and the cost of quantizing them is low compared to the overall cost of the matrix multiplication. For example, during generation with llama 7B, most matrix multiplications have dimensions [4096,4096] x [4096]. By quantizing these tensors to integers, it is possible to use mostly integer ops during the dot products, with only a few floating point operations in the end to scale the result. This is usually a lot faster than using only floating points operations for the entire matrix multiplication. There may also be a small performance gain due to reduced memory bandwidth requirements, but I believe that most of it is from the use of integer ops.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Why do all Q dot product methods use Q8 for non-model tensors? #3477

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Why do all Q dot product methods use Q8 for non-model tensors? #3477

Uh oh!

tjake Oct 4, 2023

Replies: 1 comment

Uh oh!

Uh oh!

slaren Oct 4, 2023 Maintainer

tjake
Oct 4, 2023

slaren
Oct 4, 2023
Maintainer