For matrix multiplication operations with quantized models, is the fp32 src1 first quantized to Q8_1? #11743

BishmoyPaul · 2025-02-07T23:35:56Z

BishmoyPaul
Feb 7, 2025

I was looking into ggml_cuda_mul_mat (ggml-cuda.cu L1844) to understand how it works for quantized models. It seems for quantized models, if the src1 (which is often the input/hidden state) is in FP32 format, it is first converted to Q8_1 first before the actual operation - for example, in L1899, quantize_row_q8_1_cuda is being passed as an argument.

Am I correct in assuming it is indeed using Q8_1 for src1? If so, my question is, why are we using Q8_1 for src1? I mean, for simpler quantizations like Q4_0, where the model weights are Q4_0 quantized, why would we convert src1 to Q8_1? Why not Q8_0?

xpx-best · 2025-02-14T12:14:09Z

xpx-best
Feb 14, 2025

@BishmoyPaul Excuse me, I saw that you left a successful reproduction result on the cap4video forum before. Can you tell me the specific running instructions? The result I reproduced was very poor. I earnestly request that you assist me

1 reply

BishmoyPaul Feb 16, 2025
Author

Replied in the Cap4Video Repo

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

For matrix multiplication operations with quantized models, is the fp32 src1 first quantized to Q8_1? #11743

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

For matrix multiplication operations with quantized models, is the fp32 src1 first quantized to Q8_1? #11743

Uh oh!

BishmoyPaul Feb 7, 2025

Replies: 1 comment · 1 reply

Uh oh!

xpx-best Feb 14, 2025

Uh oh!

BishmoyPaul Feb 16, 2025 Author

BishmoyPaul
Feb 7, 2025

Replies: 1 comment 1 reply

xpx-best
Feb 14, 2025

BishmoyPaul Feb 16, 2025
Author