Implicit Q8_1 quantization for matrix multiplications? #1288
-
The CUDA backend implicitly converts src1 to q8_1 if src0 is quantized. Why? |
Beta Was this translation helpful? Give feedback.
Answered by
JohannesGaessler
Jul 1, 2025
Replies: 1 comment
-
Because in CUDA there is an instruction |
Beta Was this translation helpful? Give feedback.
0 replies
Answer selected by
tlemo
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Because in CUDA there is an instruction
__dp4a
for per-byte dot products as well as tensor core instructions for int8 matrix multiplications.