bf16 --> bf16 conversion still has f32 tensors? #9590

arch-btw · 2024-09-22T14:14:22Z

arch-btw
Sep 22, 2024

Hi,

I'm a bit lost when it comes to using convert_hf_to_gguf.py and was hoping that someone could help me understand.

Lets say I want to quantize google/gemma-2-9b-it. The first thing I do is check what format the original weights are in.

When I look on HF it says this model is bf16 on the model's page.
Additionally, I think it also says the same in config.json: "torch_dtype": "bfloat16"
And finally, I checked the metadata on the actual safetensors with a small python script, this also confirms: bf16.

So my question is: if I want it to be completely lossless, would it make more sense to convert it to bf16 at this step?
My reasoning would be that it would stay in the same format.

Or alternatively, would it actually make more sense to convert to f32 for it to be completely lossless?

The part that confuses me is that when I convert it to bf16 and then run ./llama-quantize on the resulting bf16 file (to create a Q5_K_M file) it outputs this:

llama_model_loader: - kv  33:                    tokenizer.chat_template str              = {{ bos_token }}{% for message in mess...
llama_model_loader: - kv  34:            tokenizer.ggml.add_space_prefix bool             = false
llama_model_loader: - kv  35:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:  169 tensors
llama_model_loader: - type bf16:  295 tensors
[   1/ 464]                    token_embd.weight - [ 3584, 256000,     1,     1], type =   bf16, converting to f16 .. size =  1750.00 MiB ->  1750.00 MiB
[   2/ 464]               blk.0.attn_norm.weight - [ 3584,     1,     1,     1], type =    f32, size =    0.014 MB
[   3/ 464]                blk.0.ffn_down.weight - [14336,  3584,     1,     1], type =   bf16, converting to q6_K .. size =    98.00 MiB ->    40.20 MiB
[   4/ 464]                blk.0.ffn_gate.weight - [ 3584, 14336,     1,     1], type =   bf16, converting to q5_K .. size =    98.00 MiB ->    33.69 MiB
[   5/ 464]                  blk.0.ffn_up.weight - [ 3584, 14336,     1,     1], type =   bf16, converting to q5_K .. size =    98.00 MiB ->    33.69 MiB
[   6/ 464]     blk.0.post_attention_norm.weight - [ 3584,     1,     1,     1], type =    f32, size =    0.014 MB
[   7/ 464]           blk.0.post_ffw_norm.weight - [ 3584,     1,     1,     1], type =    f32, size =    0.014 MB
[   8/ 464]                blk.0.ffn_norm.weight - [ 3584,     1,     1,     1], type =    f32, size =    0.014 MB

This is the part that doesn't make sense to me:

llama_model_loader: - type f32: 169 tensors

I don't understand why there would be 169 f32 tensors if I had just converted it from bf16 safetensors to a bf16 .gguf?

Thank you for your time!

Answered by CISC

Sep 23, 2024

So my question is: if I want it to be completely lossless, would it make more sense to convert it to bf16 at this step? My reasoning would be that it would stay in the same format.

For archival purposes, for sure.

Or alternatively, would it actually make more sense to convert to f32 for it to be completely lossless?

Depends, a minor issue with storing it in BF16 is that this format is not directly supported (yet) by most hardware acceleration in llama.cpp. If you want to do inference from the original unquantized model, F32 is probably the way to go right now (but again, this may change in the future).

The part that confuses me is that when I convert it to bf16 and then run ./llama-q…

View full answer

CISC · 2024-09-23T13:00:24Z

CISC
Sep 23, 2024
Collaborator

So my question is: if I want it to be completely lossless, would it make more sense to convert it to bf16 at this step? My reasoning would be that it would stay in the same format.

For archival purposes, for sure.

Or alternatively, would it actually make more sense to convert to f32 for it to be completely lossless?

Depends, a minor issue with storing it in BF16 is that this format is not directly supported (yet) by most hardware acceleration in llama.cpp. If you want to do inference from the original unquantized model, F32 is probably the way to go right now (but again, this may change in the future).

The part that confuses me is that when I convert it to bf16 and then run ./llama-quantize on the resulting bf16 file (to create a Q5_K_M file) it outputs this:
[...]
This is the part that doesn't make sense to me:

llama_model_loader: - type f32: 169 tensors

Certain weights will never be quantized (usually because doing so would practically destroy the model), this output just recounts how many were left in original quality.

I don't understand why there would be 169 f32 tensors if I had just converted it from bf16 safetensors to a bf16 .gguf?

This is not about the tensors in the original BF16, but rather the tensors in your new quantized GGUF. :)

0 replies

arch-btw · 2024-09-23T15:43:43Z

arch-btw
Sep 23, 2024
Author

@CISC Thank you for your very helpful answer, this answers all my question!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

bf16 --> bf16 conversion still has f32 tensors? #9590

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

bf16 --> bf16 conversion still has f32 tensors? #9590

Uh oh!

arch-btw Sep 22, 2024

Replies: 2 comments

Uh oh!

CISC Sep 23, 2024 Collaborator

Uh oh!

arch-btw Sep 23, 2024 Author

arch-btw
Sep 22, 2024

CISC
Sep 23, 2024
Collaborator

arch-btw
Sep 23, 2024
Author