Recipe for quantizing + LoRA-ing models #15959
Unanswered
davisyoshida
asked this question in
Show and tell
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
I just put up a notebook which shows how to combine GPT-Q and LoRA for finetuning of JAX models. I posted my LoRA implementation here already, so the new thing is the GPT-Q script which makes it really easy to apply it to arbitrary functions.
The torch implementation of GPT-Q mostly looks like it relies on people handwriting a script to apply it to each new architecture (there's a repo just for applying it to LLaMA). In the JAX case I was able to write a single implementation which works for everything from a single matmul to HuggingFace's GPT-2 and T5 models.
Combining GPT-Q and LoRA for something like LLaMA lets you store the large number of model parameters in 4 bits (so 6.5 GB for the 13B LLaMA), and then finetune a small (< 10M) number of parameters for finetuning. This recipe has been letting me finetune the 7B LLaMA successfully on a single 24GB GPU.
Basically all you need to do is:
Edit: Oh I should probably mentation that the GPT-Q code is pretty rough right now but I'm hoping to clean it up in the next couple weeks 😅. It's generally good enough for the stuff I've been using it for, but might not work for all use cases.
Beta Was this translation helpful? Give feedback.
All reactions