No quantization for activations? #3349

Z-KN · 2023-09-27T00:29:05Z

Z-KN
Sep 27, 2023

There are a lot of quantization options for weights, I wonder whether there is a quantization process for activations?

When I add printf in ggml_compute_forward_mul_mat function, it shows the src0 tensor has data type of either 1, 2, or 14 (meaning fp16, q4_0, and q6_k respectively), while src1 always has data type of 0, which stands for fp32. So I infer that llama.cpp never quantizes activations? All tokens and activations are computed using floating points?

    const enum ggml_type type = src0->type;
    const enum ggml_type type1 = src1->type;
    printf("type = %d, type1 = %d\n", type, type1);  // shows type = 1/2/14, type1 = 0

Answered by ggerganov

Sep 27, 2023

@KerfuffleV2
The term "activations" refers to the intermediate results obtained during the evaluation of the transformer. It does not mean the 1D tensors in the model. This is a terminology that I also recently learned

The activations in ggml are generally quantized when:

running on the CPU
running with CUDA

They are not quantized yet with Metal.

On the CPU, even though src1 has type F32, it is still being quantized internally in the matrix multiplication call:

https://github.com/ggerganov/llama.cpp/blob/a40f2b656fab364ce0aff98dbefe9bd9c3721cc9/ggml.c#L11333-L11349

The activations are always quantized to 8-bits (.vec_dot_type):

https://github.com/ggerganov/llama.cpp/blob/a40f2b656fab364…

View full answer

KerfuffleV2 · 2023-09-27T01:50:48Z

KerfuffleV2
Sep 27, 2023
Collaborator

So I infer that llama.cpp never quantizes activations?

1D tensors don't get quantized since they're so small. There's basically no advantage to reducing the quality because you don't really gain anything. I don't think it has anything to do with activations, just the size of certain tensors. edit: Please disregard.

2 replies

Z-KN Sep 27, 2023
Author

But with fixed point activations, say int8, can't we get faster inference speed? Not merely about the size.

KerfuffleV2 Sep 27, 2023
Collaborator

I don't know about that part. Also, as far as I know during inference the values get dequantized to 32bit before the operations are performed. You might be right about squeezing more performance out, but it would need to be something that worked with the various architectures (CPU, Metal, CUDA, etc) llama.cpp supports if it was implemented at the quantization level. edit: Please disregard.

rikoras · 2023-09-27T08:05:16Z

rikoras
Sep 27, 2023

I guess there may be some unpredictable troubles applying integer quantization to activations, since some non-linear operators may cause math-related issues.
There is a paper which may provide a better illustration.
https://proceedings.mlr.press/v139/kim21d.html

0 replies

ggerganov · 2023-09-27T08:28:40Z

ggerganov
Sep 27, 2023
Maintainer

@KerfuffleV2
The term "activations" refers to the intermediate results obtained during the evaluation of the transformer. It does not mean the 1D tensors in the model. This is a terminology that I also recently learned

The activations in ggml are generally quantized when:

running on the CPU
running with CUDA

They are not quantized yet with Metal.

On the CPU, even though src1 has type F32, it is still being quantized internally in the matrix multiplication call:

https://github.com/ggerganov/llama.cpp/blob/a40f2b656fab364ce0aff98dbefe9bd9c3721cc9/ggml.c#L11333-L11349

The activations are always quantized to 8-bits (.vec_dot_type):

https://github.com/ggerganov/llama.cpp/blob/a40f2b656fab364ce0aff98dbefe9bd9c3721cc9/ggml.c#L1616-L1781

For more info:

2 replies

KerfuffleV2 Sep 27, 2023
Collaborator

Thanks! Sorry for the misinformation. I edited those comments.

Z-KN Sep 27, 2023
Author

Thank you very much!! I failed to read the code in more detail.

manu-web · 2024-07-31T19:24:35Z

manu-web
Jul 31, 2024

How do we verify for sure what is the activation quantization?
And also for AMD GPUs what is the quantization that we are using?

0 replies

No quantization for activations? #3349

Uh oh!

Uh oh!

Z-KN Sep 27, 2023

Replies: 4 comments · 4 replies

Uh oh!

Uh oh!

KerfuffleV2 Sep 27, 2023 Collaborator

Uh oh!

Z-KN Sep 27, 2023 Author

Uh oh!

Uh oh!

KerfuffleV2 Sep 27, 2023 Collaborator

Uh oh!

Uh oh!

rikoras Sep 27, 2023

Uh oh!

Uh oh!

ggerganov Sep 27, 2023 Maintainer

Uh oh!

KerfuffleV2 Sep 27, 2023 Collaborator

Uh oh!

Z-KN Sep 27, 2023 Author

Uh oh!

manu-web Jul 31, 2024

Z-KN
Sep 27, 2023

Replies: 4 comments 4 replies

KerfuffleV2
Sep 27, 2023
Collaborator

Z-KN Sep 27, 2023
Author

KerfuffleV2 Sep 27, 2023
Collaborator

rikoras
Sep 27, 2023

ggerganov
Sep 27, 2023
Maintainer

KerfuffleV2 Sep 27, 2023
Collaborator

Z-KN Sep 27, 2023
Author

manu-web
Jul 31, 2024