Implicit Q8_1 quantization for matrix multiplications? #1288

tlemo · 2025-07-01T04:51:18Z

tlemo
Jul 1, 2025

The CUDA backend implicitly converts src1 to q8_1 if src0 is quantized. Why?

Because in CUDA there is an instruction __dp4a for per-byte dot products as well as tensor core instructions for int8 matrix multiplications.

JohannesGaessler · 2025-07-01T14:55:26Z

Because in CUDA there is an instruction __dp4a for per-byte dot products as well as tensor core instructions for int8 matrix multiplications.

0 replies