Replies: 1 comment 3 replies
-
The problem with this model is that some of the experts use a different quantization type than the rest. This is not supported in the current version of llama.cpp, all the experts must have the same type. If you want to use this model, you would need to convert it to gguf and quantize it yourself with a recent version of llama.cpp. |
Beta Was this translation helpful? Give feedback.
3 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Hello everyone,
I'm pretty new to llama.cpp and really have been using it only through other tools so my question might be stupid.
So, I'm trying to load a model using a tool called gradio (https://www.gradio.app/) in order to build quick apps that I can share with team members.
Gradio uses Llama.cpp to load GGUF models for inference.
I'm currently looking for models that allows me to do some special NER tasks.
Being not quite happy with what I have, I decided to use a tool called Ludwig to fine-tune models in order to specialise model to a specific extraction task..
So I took several models, fine-tuned them and created Lora adapters files ( .safetensors files), converted those files to ggml files and then exported those files to GGUF.
Then I use gradio to host those models and validate the quality of those models.
It did work for most of the models I tested, Llama2, Llama3, Mixtral-8 and Phi-3.
Most of the time I use quantizationed versions of those models because I'm quite GPU limited
But when I tested with an open-GPT4 model (https://huggingface.co/TheBloke/Open_Gpt4_8x7B_v0.2-GGUF) with quantizationton 4 bits open_gpt4_8x7b_v0.2.Q4_K_M.gguf I had a very weird issue:
I haven't found this error anywhere so I'm quite not sure what is happening here, if anyone had an idea that would be great.
Thx in advance
Beta Was this translation helpful? Give feedback.
All reactions