-
I was converting llama2-13b from pth to ggml, and I noticed the f16 vectors in the input are converted to f32 in the output. Ex: https://github.com/ggerganov/llama.cpp/blob/08a0c0206075556e82aca0feafad530dcc5f1426/convert.py#L142 I went through the history, and the forced conversion to F32 was added when convert.py was first created in early 2023, when it replaced convert-pth-to-ggml.py. I don't see anything in convert-pth-to-ggml.py that did the forced conversion. I mostly just want to updated the comments in convert.py to explain why it's doing that conversion. I'm wondering if it was done because the GPU support for F32 was better at the time then F16 support, so it made sense to convert everything to F32, since all of the dequantize_mul_mat_vec_* operations take in a f32 array? But some operations like But all of this is about what the backends want. Part of me thinks it'd be more logical to store it as a F16, and convert it to F32 if it's loaded by the cuda backend? Realistically, maybe it just needs a note that says "All the llama code is optimized assuming vectors are F32. Because of that we store all vectors on disk as F32. It could be possibly stored as F16, but then during loading it would have to be converted to F32, and since these all small memory-wise, this is the better choice."? Back story: I want to contribute to the cuda code, and so I'm trying to familiarize myself with the code base and PR process. I noticed this, and figured it'd be a simple-ish item to review/document. I'll do a PR for either a better comment or removing the forced conversion. I have no problem doing a larger fix, but figured I'd start with something logically simple. |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 2 replies
-
The reason to keep and/or cast the 1D tensors (i.e. vectors) in F32 format is because they are very small compared to all other 2D tensors in the models. The performance difference between having F16 vs F32 1D tensors will be negligible (except for some very small models probably). Therefore it is easier to have a single F32 implementation of the respective operators ( |
Beta Was this translation helpful? Give feedback.
The reason to keep and/or cast the 1D tensors (i.e. vectors) in F32 format is because they are very small compared to all other 2D tensors in the models. The performance difference between having F16 vs F32 1D tensors will be negligible (except for some very small models probably). Therefore it is easier to have a single F32 implementation of the respective operators (
ggml_scale
, etc.) and keep the data with the highest precision. In the future this can be extended to support F16 vectors, but a first big step before that is adding support for F16 output - currently (almost) allggml
operators produce the result in F32 format