Replies: 1 comment 1 reply
-
Quantization is mostly used to reduce the size of the weights that need to be multiplied (matrix multiplication |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
In basic gguf format are all the weights and activations are being processed as fp16 or there are separate operators for int8 and other formats?. Operators like convolutional , fully connected layer and ffn??
What I understood from ggml that quantization is storage purposes only. While inference it'll convert qX to fp16 or 32. Is my understanding wrong??
And
If I want to add the int8 specific operators how to do that. Can anyone help me???
Beta Was this translation helpful? Give feedback.
All reactions