-
Hi, I'm a bit lost when it comes to using Lets say I want to quantize
So my question is: if I want it to be completely lossless, would it make more sense to convert it to bf16 at this step? Or alternatively, would it actually make more sense to convert to f32 for it to be completely lossless? The part that confuses me is that when I convert it to bf16 and then run
This is the part that doesn't make sense to me:
I don't understand why there would be 169 Thank you for your time! |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments
-
For archival purposes, for sure.
Depends, a minor issue with storing it in BF16 is that this format is not directly supported (yet) by most hardware acceleration in
Certain weights will never be quantized (usually because doing so would practically destroy the model), this output just recounts how many were left in original quality.
This is not about the tensors in the original BF16, but rather the tensors in your new quantized GGUF. :) |
Beta Was this translation helpful? Give feedback.
-
@CISC Thank you for your very helpful answer, this answers all my question! |
Beta Was this translation helpful? Give feedback.
For archival purposes, for sure.
Depends, a minor issue with storing it in BF16 is that this format is not directly supported (yet) by most hardware acceleration in
llama.cpp
. If you want to do inference from the original unquantized model, F32 is probably the way to go right now (but again, this may change in the future).