Specific int8 operators for conv and fully connected? #10151

mikhilg10 · 2024-11-03T19:27:46Z

mikhilg10
Nov 3, 2024

In basic gguf format are all the weights and activations are being processed as fp16 or there are separate operators for int8 and other formats?. Operators like convolutional , fully connected layer and ffn??
What I understood from ggml that quantization is storage purposes only. While inference it'll convert qX to fp16 or 32. Is my understanding wrong??
And
If I want to add the int8 specific operators how to do that. Can anyone help me???

FSSRepo · 2024-11-03T21:37:48Z

FSSRepo
Nov 3, 2024
Collaborator

Quantization is mostly used to reduce the size of the weights that need to be multiplied (matrix multiplication ggml_mul_mat); the rest of the operations are performed in fp16 or fp32, depending on the case.

1 reply

mikhilg10 Nov 5, 2024
Author

so the activations are used as it is in fp32 and fp16??

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Specific int8 operators for conv and fully connected? #10151

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Specific int8 operators for conv and fully connected? #10151

Uh oh!

Uh oh!

mikhilg10 Nov 3, 2024

Replies: 1 comment · 1 reply

Uh oh!

Uh oh!

FSSRepo Nov 3, 2024 Collaborator

Uh oh!

mikhilg10 Nov 5, 2024 Author

mikhilg10
Nov 3, 2024

Replies: 1 comment 1 reply

FSSRepo
Nov 3, 2024
Collaborator

mikhilg10 Nov 5, 2024
Author