Vectors are always converted to F32 when running convert.py #6497

kunnis · 2024-04-04T22:33:52Z

kunnis
Apr 4, 2024

I was converting llama2-13b from pth to ggml, and I noticed the f16 vectors in the input are converted to f32 in the output. Ex: norm.weight F16 became output_norm.weight F32

https://github.com/ggerganov/llama.cpp/blob/08a0c0206075556e82aca0feafad530dcc5f1426/convert.py#L142

I went through the history, and the forced conversion to F32 was added when convert.py was first created in early 2023, when it replaced convert-pth-to-ggml.py. I don't see anything in convert-pth-to-ggml.py that did the forced conversion.

I mostly just want to updated the comments in convert.py to explain why it's doing that conversion. I'm wondering if it was done because the GPU support for F32 was better at the time then F16 support, so it made sense to convert everything to F32, since all of the dequantize_mul_mat_vec_* operations take in a f32 array? But some operations like ggml_cuda_op_mul_mat_cublas actually convert the data back to F16. But some operations expect src1 to be F32, like ggml_cuda_mul_mat_vec_p021 and ggml_cuda_mul_mat_vec_nc.

But all of this is about what the backends want. Part of me thinks it'd be more logical to store it as a F16, and convert it to F32 if it's loaded by the cuda backend? Realistically, maybe it just needs a note that says "All the llama code is optimized assuming vectors are F32. Because of that we store all vectors on disk as F32. It could be possibly stored as F16, but then during loading it would have to be converted to F32, and since these all small memory-wise, this is the better choice."?

Back story: I want to contribute to the cuda code, and so I'm trying to familiarize myself with the code base and PR process. I noticed this, and figured it'd be a simple-ish item to review/document. I'll do a PR for either a better comment or removing the forced conversion. I have no problem doing a larger fix, but figured I'd start with something logically simple.

Answered by ggerganov

Apr 5, 2024

The reason to keep and/or cast the 1D tensors (i.e. vectors) in F32 format is because they are very small compared to all other 2D tensors in the models. The performance difference between having F16 vs F32 1D tensors will be negligible (except for some very small models probably). Therefore it is easier to have a single F32 implementation of the respective operators (ggml_scale, etc.) and keep the data with the highest precision. In the future this can be extended to support F16 vectors, but a first big step before that is adding support for F16 output - currently (almost) all ggml operators produce the result in F32 format

View full answer

ggerganov · 2024-04-05T18:30:10Z

ggerganov
Apr 5, 2024
Maintainer

The reason to keep and/or cast the 1D tensors (i.e. vectors) in F32 format is because they are very small compared to all other 2D tensors in the models. The performance difference between having F16 vs F32 1D tensors will be negligible (except for some very small models probably). Therefore it is easier to have a single F32 implementation of the respective operators (ggml_scale, etc.) and keep the data with the highest precision. In the future this can be extended to support F16 vectors, but a first big step before that is adding support for F16 output - currently (almost) all ggml operators produce the result in F32 format

2 replies

kunnis Apr 5, 2024
Author

Cool. I'll do a PR to add that as a comment, that way I can do a practice run of the PR process.

I've done some test modifications of the CUDA code, and I was able to get a small speed up (5%) on my 3060, though I know it didn't handle some edge cases. Do you have thoughts about how to handle the on-disk representation being different then the vram representation? When I was testing, the data wasn't aligned, and I'd like to test if improving data alignment improves thruput. Part of the reason I looked at the converter was to see about having the in-memory and on-disk types being different.

sorasoras Apr 6, 2024

I think there might be speed benefit to keep these vector at its original form like bf16. it would probably use less vram and take advantage tensor core or Apple npu.
as current stand

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Vectors are always converted to F32 when running convert.py #6497

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Vectors are always converted to F32 when running convert.py #6497

Uh oh!

kunnis Apr 4, 2024

Replies: 1 comment · 2 replies

Uh oh!

ggerganov Apr 5, 2024 Maintainer

Uh oh!

kunnis Apr 5, 2024 Author

Uh oh!

sorasoras Apr 6, 2024

kunnis
Apr 4, 2024

Replies: 1 comment 2 replies

ggerganov
Apr 5, 2024
Maintainer

kunnis Apr 5, 2024
Author