Can we fine-tune LLaVa models in llama.cpp? #4266
Replies: 2 comments 6 replies
-
llava is a multi model scenario. The current finetune parts can only fintune the llama model. But, the projection model (the glue between vit/clip embedding and llama token embedding) can be and was pretrained with vit/clip and llama models frozen. And using a non-finetuned llama model with the mmproj seems to work ok, its just not as good as the additional llava llama-finetune.
In llama, each token gets fed into the model, but how can the model know a tokens meaning? Each token has an embedding, that is specific for the llama model, so you really are feeding in a series of token-embeddings. How do we get an image in here? We insert it as an image-embedding, since its just a vector pointing into the llama-token-embedding-space we can technically insert any embedding with any meaning, not limited by the tokens it was trained on. This is where the projector comes in. We take an existing image-to-embedding model, take the embedding it generates and then need to map it from one embedding space to another embedding space. That is all it does. It's like glue between 2 independent models. |
Beta Was this translation helpful? Give feedback.
-
My guess is that it should be possible with relatively small effort. For clarification: I'm quite sure you can do a lot of customization by just fine tuning Step 3, even new concepts (image features it has not seen before) might be properly fine-tuneable that way. Would be interesting to know how much impact a pure llm fine tuning has compared to adding |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
I understand there is inference support for LLaVa models in llama.cpp now, but is it possible to finetune them, too? Are the image embeddings fundamentally incompatible with llama.cpp's finetune program, or could finetuning of LLaVa be done in a similar way as regular text-only LLMs?
If anything, would someone please explain what the mmproj/llava projector is, and why it doesn't fit in the GGUF file? Maybe that will help answer the question, too.
Beta Was this translation helpful? Give feedback.
All reactions