Weird error message when loading a model #9299

wDomin · 2024-09-03T14:59:00Z

wDomin
Sep 3, 2024

Hello everyone,
I'm pretty new to llama.cpp and really have been using it only through other tools so my question might be stupid.
So, I'm trying to load a model using a tool called gradio (https://www.gradio.app/) in order to build quick apps that I can share with team members.
Gradio uses Llama.cpp to load GGUF models for inference.

I'm currently looking for models that allows me to do some special NER tasks.
Being not quite happy with what I have, I decided to use a tool called Ludwig to fine-tune models in order to specialise model to a specific extraction task..

So I took several models, fine-tuned them and created Lora adapters files ( .safetensors files), converted those files to ggml files and then exported those files to GGUF.

Then I use gradio to host those models and validate the quality of those models.

It did work for most of the models I tested, Llama2, Llama3, Mixtral-8 and Phi-3.
Most of the time I use quantizationed versions of those models because I'm quite GPU limited

But when I tested with an open-GPT4 model (https://huggingface.co/TheBloke/Open_Gpt4_8x7B_v0.2-GGUF) with quantizationton 4 bits open_gpt4_8x7b_v0.2.Q4_K_M.gguf I had a very weird issue:

2024-09-03T13:47:33Z [app] [l4jdt] llama_model_loader: loaded meta data with 25 key-value pairs and 995 tensors from /workspace/.cache/huggingface/hub/models--XXXXX--Open_gpt4_8x7b_v0.2.Q4_K_M.Lora-Adapter-GGUF/snapshots/56e5dfed05fdd997713eec5e12ea5fbb28dd337d/./open_gpt4_8x7b_v0.2.Q4_K_M.Lora-Adapter.gguf (version GGUF V3 (latest))
2024-09-03T13:47:33Z [app] [l4jdt] llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
2024-09-03T13:47:33Z [app] [l4jdt] llama_model_loader: - kv   0:                       general.architecture str              = llama
2024-09-03T13:47:33Z [app] [l4jdt] llama_model_loader: - kv   1:                               general.name str              = rombodawg_open_gpt4_8x7b_v0.2
2024-09-03T13:47:33Z [app] [l4jdt] llama_model_loader: - kv   2:                       llama.context_length u32              = 32768
2024-09-03T13:47:33Z [app] [l4jdt] llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
2024-09-03T13:47:33Z [app] [l4jdt] llama_model_loader: - kv   4:                          llama.block_count u32              = 32
2024-09-03T13:47:33Z [app] [l4jdt] llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
2024-09-03T13:47:33Z [app] [l4jdt] llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
2024-09-03T13:47:33Z [app] [l4jdt] llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 32
2024-09-03T13:47:33Z [app] [l4jdt] llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 8
2024-09-03T13:47:33Z [app] [l4jdt] llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
2024-09-03T13:47:33Z [app] [l4jdt] llama_model_loader: - kv  10:                         llama.expert_count u32              = 8
2024-09-03T13:47:33Z [app] [l4jdt] llama_model_loader: - kv  11:                    llama.expert_used_count u32              = 2
2024-09-03T13:47:33Z [app] [l4jdt] llama_model_loader: - kv  12:                       llama.rope.freq_base f32              = 1000000.000000
2024-09-03T13:47:33Z [app] [l4jdt] llama_model_loader: - kv  13:                          general.file_type u32              = 15
2024-09-03T13:47:33Z [app] [l4jdt] llama_model_loader: - kv  14:                       tokenizer.ggml.model str              = llama
2024-09-03T13:47:33Z [app] [l4jdt] llama_model_loader: - kv  15:                      tokenizer.ggml.tokens arr[str,32000]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
2024-09-03T13:47:33Z [app] [l4jdt] llama_model_loader: - kv  16:                      tokenizer.ggml.scores arr[f32,32000]   = [0.000000, 0.000000, 0.000000, 0.0000...
2024-09-03T13:47:33Z [app] [l4jdt] llama_model_loader: - kv  17:                  tokenizer.ggml.token_type arr[i32,32000]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
2024-09-03T13:47:33Z [app] [l4jdt] llama_model_loader: - kv  18:                tokenizer.ggml.bos_token_id u32              = 1
2024-09-03T13:47:33Z [app] [l4jdt] llama_model_loader: - kv  19:                tokenizer.ggml.eos_token_id u32              = 2
2024-09-03T13:47:33Z [app] [l4jdt] llama_model_loader: - kv  20:            tokenizer.ggml.unknown_token_id u32              = 0
2024-09-03T13:47:33Z [app] [l4jdt] llama_model_loader: - kv  21:            tokenizer.ggml.padding_token_id u32              = 0
2024-09-03T13:47:33Z [app] [l4jdt] llama_model_loader: - kv  22:               tokenizer.ggml.add_bos_token bool             = true
2024-09-03T13:47:33Z [app] [l4jdt] llama_model_loader: - kv  23:               tokenizer.ggml.add_eos_token bool             = false
2024-09-03T13:47:33Z [app] [l4jdt] llama_model_loader: - kv  24:               general.quantization_version u32              = 2
2024-09-03T13:47:33Z [app] [l4jdt] llama_model_loader: - type  f32:   65 tensors
2024-09-03T13:47:33Z [app] [l4jdt] llama_model_loader: - type  f16:   32 tensors
2024-09-03T13:47:33Z [app] [l4jdt] llama_model_loader: - type q8_0:   64 tensors
2024-09-03T13:47:33Z [app] [l4jdt] llama_model_loader: - type q4_K:  705 tensors
2024-09-03T13:47:33Z [app] [l4jdt] llama_model_loader: - type q6_K:  129 tensors
2024-09-03T13:47:33Z [app] [l4jdt] llm_load_vocab: special tokens cache size = 259
2024-09-03T13:47:33Z [app] [l4jdt] llm_load_vocab: token to piece cache size = 0.1637 MB
2024-09-03T13:47:33Z [app] [l4jdt] llm_load_print_meta: format           = GGUF V3 (latest)
2024-09-03T13:47:33Z [app] [l4jdt] llm_load_print_meta: arch             = llama
2024-09-03T13:47:33Z [app] [l4jdt] llm_load_print_meta: vocab type       = SPM
2024-09-03T13:47:33Z [app] [l4jdt] llm_load_print_meta: n_vocab          = 32000
2024-09-03T13:47:33Z [app] [l4jdt] llm_load_print_meta: n_merges         = 0
2024-09-03T13:47:33Z [app] [l4jdt] llm_load_print_meta: n_ctx_train      = 32768
2024-09-03T13:47:33Z [app] [l4jdt] llm_load_print_meta: n_embd           = 4096
2024-09-03T13:47:33Z [app] [l4jdt] llm_load_print_meta: n_head           = 32
2024-09-03T13:47:33Z [app] [l4jdt] llm_load_print_meta: n_head_kv        = 8
2024-09-03T13:47:33Z [app] [l4jdt] llm_load_print_meta: n_layer          = 32
2024-09-03T13:47:33Z [app] [l4jdt] llm_load_print_meta: n_rot            = 128
2024-09-03T13:47:33Z [app] [l4jdt] llm_load_print_meta: n_swa            = 0
2024-09-03T13:47:33Z [app] [l4jdt] llm_load_print_meta: n_embd_head_k    = 128
2024-09-03T13:47:33Z [app] [l4jdt] llm_load_print_meta: n_embd_head_v    = 128
2024-09-03T13:47:33Z [app] [l4jdt] llm_load_print_meta: n_gqa            = 4
2024-09-03T13:47:33Z [app] [l4jdt] llm_load_print_meta: n_embd_k_gqa     = 1024
2024-09-03T13:47:33Z [app] [l4jdt] llm_load_print_meta: n_embd_v_gqa     = 1024
2024-09-03T13:47:33Z [app] [l4jdt] llm_load_print_meta: f_norm_eps       = 0.0e+00
2024-09-03T13:47:33Z [app] [l4jdt] llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
2024-09-03T13:47:33Z [app] [l4jdt] llm_load_print_meta: f_clamp_kqv      = 0.0e+00
2024-09-03T13:47:33Z [app] [l4jdt] llm_load_print_meta: f_max_alibi_bias = 0.0e+00
2024-09-03T13:47:33Z [app] [l4jdt] llm_load_print_meta: f_logit_scale    = 0.0e+00
2024-09-03T13:47:33Z [app] [l4jdt] llm_load_print_meta: n_ff             = 14336
2024-09-03T13:47:33Z [app] [l4jdt] llm_load_print_meta: n_expert         = 8
2024-09-03T13:47:33Z [app] [l4jdt] llm_load_print_meta: n_expert_used    = 2
2024-09-03T13:47:33Z [app] [l4jdt] llm_load_print_meta: causal attn      = 1
2024-09-03T13:47:33Z [app] [l4jdt] llm_load_print_meta: pooling type     = 0
2024-09-03T13:47:33Z [app] [l4jdt] llm_load_print_meta: rope type        = 0
2024-09-03T13:47:33Z [app] [l4jdt] llm_load_print_meta: rope scaling     = linear
2024-09-03T13:47:33Z [app] [l4jdt] llm_load_print_meta: freq_base_train  = 1000000.0
2024-09-03T13:47:33Z [app] [l4jdt] llm_load_print_meta: freq_scale_train = 1
2024-09-03T13:47:33Z [app] [l4jdt] llm_load_print_meta: n_ctx_orig_yarn  = 32768
2024-09-03T13:47:33Z [app] [l4jdt] llm_load_print_meta: rope_finetuned   = unknown
2024-09-03T13:47:33Z [app] [l4jdt] llm_load_print_meta: ssm_d_conv       = 0
2024-09-03T13:47:33Z [app] [l4jdt] llm_load_print_meta: ssm_d_inner      = 0
2024-09-03T13:47:33Z [app] [l4jdt] llm_load_print_meta: ssm_d_state      = 0
2024-09-03T13:47:33Z [app] [l4jdt] llm_load_print_meta: ssm_dt_rank      = 0
2024-09-03T13:47:33Z [app] [l4jdt] llm_load_print_meta: model type       = 8x7B
2024-09-03T13:47:33Z [app] [l4jdt] llm_load_print_meta: model ftype      = Q4_K - Medium
2024-09-03T13:47:33Z [app] [l4jdt] llm_load_print_meta: model params     = 46.70 B
2024-09-03T13:47:33Z [app] [l4jdt] llm_load_print_meta: model size       = 26.43 GiB (4.86 BPW) 
2024-09-03T13:47:33Z [app] [l4jdt] llm_load_print_meta: general.name     = rombodawg_open_gpt4_8x7b_v0.2
2024-09-03T13:47:33Z [app] [l4jdt] llm_load_print_meta: BOS token        = 1 '<s>'
2024-09-03T13:47:33Z [app] [l4jdt] llm_load_print_meta: EOS token        = 2 '</s>'
2024-09-03T13:47:33Z [app] [l4jdt] llm_load_print_meta: UNK token        = 0 '<unk>'
2024-09-03T13:47:33Z [app] [l4jdt] llm_load_print_meta: PAD token        = 0 '<unk>'
2024-09-03T13:47:33Z [app] [l4jdt] llm_load_print_meta: LF token         = 13 '<0x0A>'
2024-09-03T13:47:33Z [app] [l4jdt] llm_load_print_meta: max token length = 48
2024-09-03T13:47:33Z [app] [l4jdt] llm_load_tensors: ggml ctx size =    0.38 MiB
2024-09-03T13:47:33Z [app] [l4jdt] llama_model_load: error loading model: create_tensor_as_view: tensor 'blk.4.ffn_down.2.weight' has wrong type; expected q4_K, got q6_K
2024-09-03T13:47:33Z [app] [l4jdt] llama_load_model_from_file: failed to load model
2024-09-03T13:47:33Z [app] [l4jdt] Traceback (most recent call last):
2024-09-03T13:47:33Z [app] [l4jdt]   File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
2024-09-03T13:47:33Z [app] [l4jdt]     return _run_code(code, main_globals, None,
2024-09-03T13:47:33Z [app] [l4jdt]   File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
2024-09-03T13:47:33Z [app] [l4jdt]     exec(code, run_globals)
2024-09-03T13:47:33Z [app] [l4jdt]   File "/workspace/pdf2stix_demo/app.py", line 42, in <module>
2024-09-03T13:47:33Z [app] [l4jdt]     generator = generator_cls(
2024-09-03T13:47:33Z [app] [l4jdt]   File "<string>", line 9, in __init__
2024-09-03T13:47:33Z [app] [l4jdt]   File "/workspace/pdf2stix_demo/json_generator.py", line 129, in __post_init__
2024-09-03T13:47:33Z [app] [l4jdt]     self.llama = Llama.from_pretrained(
2024-09-03T13:47:33Z [app] [l4jdt]   File "/workspace/.venv/lib/python3.10/site-packages/llama_cpp/llama.py", line 2091, in from_pretrained
2024-09-03T13:47:33Z [app] [l4jdt]     return cls(
2024-09-03T13:47:33Z [app] [l4jdt]   File "/workspace/.venv/lib/python3.10/site-packages/llama_cpp/llama.py", line 358, in __init__
2024-09-03T13:47:33Z [app] [l4jdt]     self._model = self._stack.enter_context(contextlib.closing(_LlamaModel(
2024-09-03T13:47:33Z [app] [l4jdt]   File "/workspace/.venv/lib/python3.10/site-packages/llama_cpp/_internals.py", line 54, in __init__
2024-09-03T13:47:33Z [app] [l4jdt]     raise ValueError(f"Failed to load model from file: {path_model}")
2024-09-03T13:47:33Z [app] [l4jdt] ValueError: Failed to load model from file: /workspace/.cache/huggingface/hub/models--XXXXX--Open_gpt4_8x7b_v0.2.Q4_K_M.Lora-Adapter-GGUF/snapshots/56e5dfed05fdd997713eec5e12ea5fbb28dd337d/./open_gpt4_8x7b_v0.2.Q4_K_M.Lora-Adapter.gguf
2024-09-03T13:47:33Z [app] [l4jdt] Running with Namespace(model_id='XXXXX/Open_gpt4_8x7b_v0.2.Q4_K_M.Lora-Adapter-GGUF', gguf_filename='open_gpt4_8x7b_v0.2.Q4_K_M.Lora-Adapter.gguf', llm_backend=<LLMBackend.llamacpp: 'llamacpp'>, no_gpu=False, context_length=0, gradio_port=8080, few_shots=PosixPath('data/few-shots-sparsify-relationwise.jsonl'), n_shots=2, random_seed=42)

I haven't found this error anywhere so I'm quite not sure what is happening here, if anyone had an idea that would be great.

Thx in advance

slaren · 2024-09-03T15:04:38Z

slaren
Sep 3, 2024
Maintainer

The problem with this model is that some of the experts use a different quantization type than the rest. This is not supported in the current version of llama.cpp, all the experts must have the same type. If you want to use this model, you would need to convert it to gguf and quantize it yourself with a recent version of llama.cpp.

3 replies

wDomin Sep 3, 2024
Author

I think that it is what I did (maybe?)
I fine-tuned and quantized to 4 bits with Ludwig, and then I grabe the Q4_K_M version of the GGUF version of the model to export my safetensors to a gguf, I just don't understand the error.
I would agree with you if it was a random gguf model I used, but I used the one provided by TheBloke and with all the other models I tried, I also used TheBloke GGUF files and it worked fine.
I'm gonna try to quantize and export to gguf the base model rombodawg/Open_Gpt4_8x7B_v0.2 and I'll let you know

wDomin Sep 5, 2024
Author

Ok, my bad @slaren , I just understood what you meant by "This is not supported". But I also found out that, on the HF page of the base GGUF model, the author said it was supported by the latest llama.cpp version

Compatibility
These quantised GGUFv2 files are compatible with llama.cpp from August 27th onwards, as of commit d0cee0d
-- TheBloke/Open_Gpt4_8x7B_v0.2-GGUF

So I'm a bit lost.
I'm gonn try to re export the safetensors to GGUF with the updated llama.cpp and see if it changes things...

slaren Sep 5, 2024
Maintainer

That's August 27th 2023. It's a very old model and there has been a lot of development since then.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Weird error message when loading a model #9299

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Weird error message when loading a model #9299

Uh oh!

wDomin Sep 3, 2024

Replies: 1 comment · 3 replies

Uh oh!

slaren Sep 3, 2024 Maintainer

Uh oh!

wDomin Sep 3, 2024 Author

Uh oh!

wDomin Sep 5, 2024 Author

Uh oh!

slaren Sep 5, 2024 Maintainer

wDomin
Sep 3, 2024

Replies: 1 comment 3 replies

slaren
Sep 3, 2024
Maintainer

wDomin Sep 3, 2024
Author

wDomin Sep 5, 2024
Author

slaren Sep 5, 2024
Maintainer