Replies: 1 comment
-
hmm, i take it back. it's working fine at q4, q4_1, q5_k. Just not q5_1. q6_k works great! |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
we use llama cpp for inference (python bindings)
because fine tuning in llama cpp doesn't support fast/gpu mode like the inference does, we use python for fine tuning
we then use model.merge_and_unload to get a single model
and then we use convert.py to convert it to a gguf
the final result winds up f32
we then call quantize. i would like to quantize to q5_1, but we only ever get q8 (8-bit)
and the output says we are "re-quantizing"
(the resulting gguf works perfectly fine, and retains the fine-tune nicely - but it won't go lower than q8)
am i doing this horribly wrong?
Beta Was this translation helpful? Give feedback.
All reactions