fine tuning in torch, then quantizing in llama-cpp #3786

earonesty · 2023-10-25T21:34:27Z

earonesty
Oct 25, 2023

we use llama cpp for inference (python bindings)

because fine tuning in llama cpp doesn't support fast/gpu mode like the inference does, we use python for fine tuning

we then use model.merge_and_unload to get a single model

and then we use convert.py to convert it to a gguf

the final result winds up f32

we then call quantize. i would like to quantize to q5_1, but we only ever get q8 (8-bit)

and the output says we are "re-quantizing"

(the resulting gguf works perfectly fine, and retains the fine-tune nicely - but it won't go lower than q8)

am i doing this horribly wrong?

earonesty · 2023-10-26T14:21:25Z

hmm, i take it back. it's working fine at q4, q4_1, q5_k. Just not q5_1. q6_k works great!

0 replies