Replies: 1 comment
-
I should have asked a model first, my apologizes. in case anyone else looks it up: the output weights are quantized, but the math in producing the logits is still done in an fp32 context, so the logits are fp32. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
I noticed that llama_cpp_python's model.scores was always returning np arrays with dtype fp32, and I filed a bug, assuming they had bound the underlying llama_get_logits() function incorrectly. But then I came here and I see:
That's a hardcoded up-converted fp16 right? whether from fp16, bf16, or a k_l quantized model (float 8)
Am I missing something?
I have a bf16 model that has a context size of 8k and 256000 tokens. That means my fp32 chunk size is ~8GB. But it is bf16, its weights are only ~4GB (right?). That's a lot of unnecessary ram, if I am not mistaken.
Beta Was this translation helpful? Give feedback.
All reactions