Can we fine-tune LLaVa models in llama.cpp? #4266

mashdragon · 2023-11-30T10:16:03Z

mashdragon
Nov 30, 2023

I understand there is inference support for LLaVa models in llama.cpp now, but is it possible to finetune them, too? Are the image embeddings fundamentally incompatible with llama.cpp's finetune program, or could finetuning of LLaVa be done in a similar way as regular text-only LLMs?

If anything, would someone please explain what the mmproj/llava projector is, and why it doesn't fit in the GGUF file? Maybe that will help answer the question, too.

Green-Sky · 2023-11-30T12:10:56Z

Green-Sky
Nov 30, 2023
Collaborator

llava is a multi model scenario. The current finetune parts can only fintune the llama model. But, the projection model (the glue between vit/clip embedding and llama token embedding) can be and was pretrained with vit/clip and llama models frozen. And using a non-finetuned llama model with the mmproj seems to work ok, its just not as good as the additional llava llama-finetune.
afaik the vit/clip models where not touched and are still the original openai-clip weights.

If anything, would someone please explain what the mmproj/llava projector is

In llama, each token gets fed into the model, but how can the model know a tokens meaning? Each token has an embedding, that is specific for the llama model, so you really are feeding in a series of token-embeddings. How do we get an image in here? We insert it as an image-embedding, since its just a vector pointing into the llama-token-embedding-space we can technically insert any embedding with any meaning, not limited by the tokens it was trained on. This is where the projector comes in. We take an existing image-to-embedding model, take the embedding it generates and then need to map it from one embedding space to another embedding space. That is all it does. It's like glue between 2 independent models.

6 replies

mashdragon Dec 1, 2023
Author

I've identified where the embeddings are: they get returned by this load_image function in the LLaVA CLI script: https://github.com/ggerganov/llama.cpp/blob/1d144112c0fbbb4ecc07dbcf4f05a380148bd6de/examples/llava/llava-cli.cpp#L181 Then, the embeddings are evaluated in llava_eval_image_embed: https://github.com/ggerganov/llama.cpp/blob/1d144112c0fbbb4ecc07dbcf4f05a380148bd6de/examples/llava/llava.cpp#L70 and passed to llama_decode function in almost the same way as llama_eval_embd. Fortunately, this was easy to find.

But now I'm stuck. I am not sure how to integrate embeddings into the finetune script. It seems like the finetune script tokenizes all the training data text, but these embeddings are not tokens. I did try to turn them into tokens, just by greedily selecting the most likely one from the logits produced by llama_decode for each embedding. This revealed a few key tokens that were specifically related to the image I supplied, but most of the embeddings were converted to somewhat generic tokens, such as the word 'the' or an empty space. I tried inserting these tokens in place of the embedding, but the llama model thought the input was just gibberish, so I doubt that route is going to work.

Could you please point me in the right direction of where I would set these embeddings in place of tokens in the finetune program? Maybe @cmp-nct, you might know this too? There is a lot of code related to finetuning and training in the project. I see that the finetune program calls tokenize_file from train.cpp to get all the training samples tokenized. I'm not sure where the tokens go next.

cmp-nct Dec 1, 2023

As you already found out, llama_decode() supports tokens as well as embeddings as input.

I do not think that you should change the way llava expects data in any way, you'll need the system prompt + llava embeddings exactly as it's during normal evaluation (well, I assume system prompt could be fine tuned as well but I'd leave it).
I think embedding the image as base64 into the training data would be a compatible way of doing it, if that becomes too bloated then another replacement syntax pointing to a local directory.

Sadly I am not familiar with the finetune example and too deeply involved in another project atm, though the developer of it would maybe be interested to add llava support or has some input (@xaedes)

mashdragon Dec 1, 2023
Author

Thank you for your input. I'll certainly need to include the llava embeddings as given - either way of loading them would work. I'm hoping that the finetuning/training program is not entirely reliant on the training data being delivered as tokens.

xaedes Dec 1, 2023
Collaborator

Could you please point me in the right direction of where I would set these embeddings in place of tokens in the finetune program?

tokenize_file produces a list of tokens, which is then used by train_opt_callback, specifically get_example_targets_batch, to set input and expected next token for training.
The computational graph defined by llama_build_lora_finetune_graphs will read these input tokens and 'convert' them into embeddings from tok_embeddings via ggml_get_rows(ctx, tok_embeddings, t00);.

The 'simplest' way to insert custom image embeddings would be to add tokens for the images and setting their corresponding image embedding (acquired by encode_image_with_clip) in tok_embeddings before each optimization iteration to effectively freeze them.
This could be done inside of train_opt_callback, as it is called before each optimization iteration.
But then you have a different vocabulary size due to the additional tokens...

I note that it would make sense to change the training to accept a mix of tokens and embeddings as training input and will think about how that could look like - not sure yet.

xaedes Dec 1, 2023
Collaborator

accept a mix of tokens and embeddings as training input

Maybe adding an additional frozen_embd and masking each token whether it is regular or frozen:

Something like this:

// t01 = tok_is_frozen ? frozen_embd[t00] : tok_embeddings[t00]
// remove conditional branching by masking out unwanted values
// t01 = frozen_embd[t00] * (tok_is_frozen ? 1 : 0) + tok_embeddings[t00] * (1-(tok_is_frozen ? 1 : 0))
t01a = ggml_mul(ctx, ggml_get_rows(ctx, tok_embeddings, t00), ggml_sub(ctx, ggml_new_f32(ctx, 1), tok_is_frozen);
t01b = ggml_mul(ctx, ggml_get_rows(ctx, frozen_embd, t00), tok_is_frozen);
t01  = ggml_add(ctx, t01a, t01b);

This would leave the vocabularly in tok_embeddings as it is, requiring additional input tensors tok_is_frozen and frozen_embd, the latter containing image embeddings or similar.
Since tok_embeddings and frozen_embd will probably differ in size, the values of t00 need to somehow be masked to avoid out-of-access reads, which may result in NaN values, which then would not properly be masked by the conditional-branching-removal.

cmp-nct · 2023-11-30T16:12:35Z

cmp-nct
Nov 30, 2023

My guess is that it should be possible with relatively small effort.
You can fine tune the llm part of the stack just like other models and leave the projector and VIT unmodified.
This way you can shape the way the model responds to the embeddings.
It would just need an adaption of the current fine tune example so the image embeddings are added.

For clarification:
Step 1) The vision tower is the first step of llava, it extract features from the image such as objects, location, colors.
Step 2) The projector then aligns that output into something the language model can read as inputs
Step 3) The language model then creates text about it

I'm quite sure you can do a lot of customization by just fine tuning Step 3, even new concepts (image features it has not seen before) might be properly fine-tuneable that way. Would be interesting to know how much impact a pure llm fine tuning has compared to adding
CLIP/projector

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Can we fine-tune LLaVa models in llama.cpp? #4266

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 6 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Can we fine-tune LLaVa models in llama.cpp? #4266

Uh oh!

mashdragon Nov 30, 2023

Replies: 2 comments · 6 replies

Uh oh!

Green-Sky Nov 30, 2023 Collaborator

Uh oh!

mashdragon Dec 1, 2023 Author

Uh oh!

cmp-nct Dec 1, 2023

Uh oh!

mashdragon Dec 1, 2023 Author

Uh oh!

xaedes Dec 1, 2023 Collaborator

Uh oh!

Uh oh!

xaedes Dec 1, 2023 Collaborator

Uh oh!

cmp-nct Nov 30, 2023

mashdragon
Nov 30, 2023

Replies: 2 comments 6 replies

Green-Sky
Nov 30, 2023
Collaborator

mashdragon Dec 1, 2023
Author

mashdragon Dec 1, 2023
Author

xaedes Dec 1, 2023
Collaborator

xaedes Dec 1, 2023
Collaborator

cmp-nct
Nov 30, 2023