Performance degradation when quantizing QLoRA finetunned model #8868

TomekPro · 2024-08-05T10:04:58Z

TomekPro
Aug 5, 2024

Hi, I wanted to convert a QLoRA fine-tuned model to GGUF for CPU inference but unfortunately I observe a big model performance degradation (accuracy).
QLoRA -> GOOD
QLoRA -> merge adapter after dequantization -> GOOD
QLoRA -> merge adapter after dequantization -> convert to GGUF with llama.cpp -> BAD
QLoRA -> merge adapter after dequantization -> quantized again with the same bnb config -> BAD

From what I've read here https://kaitchup.substack.com/p/training-loading-and-merging-qdora it should work with AWQ quantization but this is not supported here. Maybe it is worth adding, I saw there was an attempt a few months ago or maybe I should use some different quantization parameters. I did python convert-hf-to-gguf.py <MERGED_MODEL_PATH> --outtype q8_0 --outfile <OUTPUT_MODEL_NAME.gguf>
Any idea what could be a solution here?

Thanks
Tomek

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Performance degradation when quantizing QLoRA finetunned model #8868

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Performance degradation when quantizing QLoRA finetunned model #8868

Uh oh!

TomekPro Aug 5, 2024

Replies: 0 comments

TomekPro
Aug 5, 2024