Use chat templates for vision models #173

DePasqualeOrg · 2025-01-01T20:41:20Z

This is a test of my PR to Swift Jinja, which should enable chat templates to be used for vision language models that have a chat template. I've started to set things up, but I need some pointers on how to integrate the image into the messages.

Libraries/MLXVLM/Models/Qwen2VL.swift

DePasqualeOrg · 2025-01-09T18:54:51Z

@davidkoski, I made some changes, and it seems to work in VLMEval. Do you have any thoughts on this?

Libraries/MLXVLM/Models/Qwen2VL.swift

DePasqualeOrg · 2025-01-14T18:59:54Z

I think UserInput will need to be changed to include messages that look like this:

{
    'role': 'user',
    'content': [
        {'type': 'text', 'text': 'What is in this image?'},
        {'type': 'image', 'image_url': 'example.jpg'}
    ]
}

DePasqualeOrg · 2025-01-14T20:31:24Z

The solution in my latest commit uses the chat template (correctly, I think) to create a prompt like this:

<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
Describe the image in English<|vision_start|><|image_pad|><|vision_end|><|im_end|>
<|im_start|>assistant

However, in order for the model to work, it looks like we need to replace the single <|image_pad|> with repeated padding like this for each image:

let mergeLength = config.mergeSize * config.mergeSize
let repeatedPadding = Array(repeating: "<|image_pad|>", count: thw.product / mergeLength).joined()

DePasqualeOrg · 2025-01-14T20:59:59Z

I now have something that works, although it still needs to take into account the case where multiple images are included.

DePasqualeOrg · 2025-01-15T09:22:54Z

@davidkoski, I found it quite difficult to reason about the code because of how some of the variables and parameters were named. What do you think about calling an array of type [THW] frames?

davidkoski · 2025-01-15T16:38:05Z

@davidkoski, I found it quite difficult to reason about the code because of how some of the variables and parameters were named. What do you think about calling an array of type [THW] frames?

it sounds ok to me, though they aren't the frames themselves but the positions of the frames in one of the arrays (maybe not in the final array). I think try frames or framePositions and see how it goes

davidkoski · 2025-01-15T16:42:54Z

The solution in my latest commit uses the chat template (correctly, I think) to create a prompt like this:
<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
Describe the image in English<|vision_start|><|image_pad|><|vision_end|><|im_end|>
<|im_start|>assistant
However, in order for the model to work, it looks like we need to replace the single <|image_pad|> with repeated padding like this for each image:
let mergeLength = config.mergeSize * config.mergeSize
let repeatedPadding = Array(repeating: "<|image_pad|>", count: thw.product / mergeLength).joined()

Right, that is this part:

https://github.com/huggingface/transformers/blob/main/src/transformers/models/qwen2_vl/processing_qwen2_vl.py#L133

I think the sequence from the python side is roughly:

add image tokens (https://github.com/Blaizzy/mlx-vlm/blob/main/mlx_vlm/prompt_utils.py)
transformers / processing (expand image tokens)
tokenize

One issue we have on the swift side is step 1 and step 3 occur in the same function in swift-transformers and we don't have a hook for step 2.

davidkoski · 2025-01-27T23:35:07Z

@DePasqualeOrg it looks like the swift-transformers side (which includes Jinja) is ready to go and would solve some issues with text models.

Do you want to prepare a PR for picking that up (since it is mostly your work)? If you are busy I can get that ready.

DePasqualeOrg · 2025-01-28T07:45:32Z

I think #185 accomplishes that. Xcode is showing the latest patch versions of the packages when I open mlx-swift-examples. Or is there something I'm missing?

huggingface/swift-transformers#151 still needs to be merged before this PR, since it expands the type of a message from [String: String] to [String: Any].

DePasqualeOrg · 2025-01-28T09:23:12Z

I've verified that this also works with multiple images, although I'll need to do further testing to check the model's performance. I noticed that Qwen 2 VL tends to respond in Mandarin unless prompted otherwise.

davidkoski · 2025-01-28T14:57:48Z

I've verified that this also works with multiple images, although I'll need to do further testing to check the model's performance. I noticed that Qwen 2 VL tends to respond in Mandarin unless prompted otherwise.

Yeah, I noticed that too. At least the responses seemed correct per google translate :-)

DePasqualeOrg · 2025-01-30T21:18:08Z

This is now ready for review.

DePasqualeOrg · 2025-02-05T08:09:07Z

I now need to make some significant changes because the video pull request was merged before this one.

davidkoski · 2025-02-05T16:08:32Z

@DePasqualeOrg the build issue was a duplicate entry in mlx-swift-examples.xcodeproj/project.xcworkspace/xcshareddata/swiftpm/Package.resolved -- I removed that and pushed it to your branch. The conflicts remain, but it is back to building.

Libraries/MLXVLM/Models/Qwen2VL.swift

DePasqualeOrg · 2025-02-08T08:04:38Z

This now works for images and videos and is ready for review.

davidkoski · 2025-02-11T20:31:40Z

I will check this out tomorrow morning (18 hours from now)!

DePasqualeOrg · 2025-02-12T13:40:18Z

I tested this in Local Chat, since it has a UI that allows me to add images and videos in the same prompt, and I noticed that my previous solution was causing the app to crash due to an error like this: MLX error: Shapes (1200,1536) and (1,192,1536) cannot be broadcast. I made some changes that seem to support a mix of images and videos.

It's worth noting that the model seems to "see" multiple images/videos as a single image, so in my experience it doesn't differentiate between images in its responses, but treats them as a combined image.

davidkoski · 2025-02-12T16:31:15Z

It's worth noting that the model seems to "see" multiple images/videos as a single image, so in my experience it doesn't differentiate between images in its responses, but treats them as a combined image.

I saw the same in python where I gave several images of my dog and it describe it as several dogs. I wonder if this is a problem in the model? As far as I can tell we are constructing the THW correctly, so it should work.

Well, I guess it is at least consistent :-)

davidkoski · 2025-02-12T16:41:03Z

Applications/VLMEval/ContentView.swift

+                                }
+                                + videos.map { _ in
+                                    ["type": "video"]
+                                },


I wonder if this should happen inside Qwen2VLProcessor? As it stands the llm-tool doesn't work because it doesn't have this augmentation.

In the python code this is specific to the model, but handled outside the model/processing code. I think it belongs with the UserInputProcessor as that is where all of these would come together.

Leaving the construction of the messages to the app would make this more flexible. I can imagine that in the future there might be different message formats. I'm not certain about this, but this approach is working well for me in my app at the moment.

As it is, it means the llm-tool doesn't work for Qwen models -- it gets an error because it is missing the tokens.

On the python side this looks like:

https://github.com/Blaizzy/mlx-vlm/blob/main/mlx_vlm/prompt_utils.py#L91

# Model to format mapping model_to_format = { # Models using message_list_with_image format "idefics2": "message_list_with_image", "idefics3": "message_list_with_image", "llava": "message_list_with_image", "llava_next": "message_list_with_image", "mllama": "message_list_with_image", # Models that can handle both image and video formats "qwen2_vl": ( "message_video_with_text" if kwargs.get("video") else "message_list_with_image" ), "qwen2_5_vl": ( "message_video_with_text" if kwargs.get("video") else "message_list_with_image" ),

and this matches the swift code:

https://github.com/Blaizzy/mlx-vlm/blob/main/mlx_vlm/prompt_utils.py#L34C1-L44C50

# Message format handlers def handle_list_with_image(): content = [create_text_message(prompt)] if role == "user" and not skip_image_token: image_tokens = [{"type": "image"}] * num_images content = ( image_tokens + content if model_name in ["pixtral", "idefics3"] else content + image_tokens ) return {"role": role, "content": content}

So the problems of leaving it to the app are two-fold:

each application/tool has to have a copy of this code

the code varies per model type and the app needs to have a table mapping to the right message structure

Perhaps we could have a way to mark the messages as already being processed (or the UserInputProcessor could inspect the messages and detect that), leaving it up to the app, but I am not sure what the app would do different than the generic processing required by the model type.

I can imagine that in the future there might be different message formats.

Yes, for sure there are, but per the python code they vary by model type and somewhat by the presence of video vs images.

I think the issue might be multi-turn conversations. I don't think UserInput has a way to designate to which turn an image or video belongs, so perhaps this is best left to the app?

Yes, I think that works ok -- we can get experience with it and take our time on the longer term approach. I suggest:

add this code to llm-tool so that it can still work with Qwen

add a comment to the code indicating that it may be model dependent (in case somebody adds a model where this doesn't work)

Then we can consider the approach at our leisure. Sound good?

As it is, apps already have to handle how the system message is treated in the construction of messages, since some models support a system role and others don't.

I think there is a lot of value getting the chat template code merged sooner rather than perfecting everything around it :-)

Sounds great! I'll make the changes to llm-tool.

llm-tool now works.

davidkoski · 2025-02-12T16:50:33Z

Libraries/MLXVLM/Models/Qwen2VL.swift

@@ -585,6 +585,10 @@ private enum Vision {
 /// This is meant to be used with ``Qwen2VL`` and is typically created by ``VLMModelFactory``.
 public class Qwen2VLProcessor: UserInputProcessor {

+    enum Qwen2VLProcessorError: Error {
+        case framesIsNil
+    }


This is fine but perhaps should live here:

public enum VLMError: Error { case imageRequired case maskRequired

as it may be common to many models.

I checked, and in fact it's not being used, so I removed it. It must be left over from a previous iteration.

davidkoski

Awesome! Thank you so much for your effort on this here and in swift-transformers and jinja, I know this took a lot of time and work to see it through!

DePasqualeOrg · 2025-02-12T18:55:28Z

Thanks for your help too!

DePasqualeOrg mentioned this pull request Jan 1, 2025

Support vision models and function calling johnmai-dev/Jinja#8

Merged

davidkoski reviewed Jan 7, 2025

View reviewed changes

Libraries/MLXVLM/Models/Qwen2VL.swift Outdated Show resolved Hide resolved

davidkoski reviewed Jan 7, 2025

View reviewed changes

Libraries/MLXVLM/Models/Qwen2VL.swift Outdated Show resolved Hide resolved

DePasqualeOrg force-pushed the vision-chat-templates branch from 5c8ccfa to 3ac6296 Compare January 9, 2025 18:53

DePasqualeOrg force-pushed the vision-chat-templates branch from 3ac6296 to 4547cf1 Compare January 9, 2025 21:43

davidkoski reviewed Jan 14, 2025

View reviewed changes

Libraries/MLXVLM/Models/Qwen2VL.swift Outdated Show resolved Hide resolved

DePasqualeOrg force-pushed the vision-chat-templates branch from 347ccc5 to 806c7f2 Compare January 14, 2025 20:59

DePasqualeOrg force-pushed the vision-chat-templates branch 2 times, most recently from 0f68fd2 to ed03ae5 Compare January 15, 2025 09:19

DePasqualeOrg force-pushed the vision-chat-templates branch 2 times, most recently from e9c7a02 to 8cb233b Compare January 19, 2025 14:18

davidkoski mentioned this pull request Jan 21, 2025

Chat template not working for DeepSeek Qwen Distill model #181

Closed

DePasqualeOrg force-pushed the vision-chat-templates branch 4 times, most recently from 8959c45 to 3e50263 Compare January 27, 2025 19:32

DePasqualeOrg force-pushed the vision-chat-templates branch 2 times, most recently from 031e47f to db97052 Compare January 28, 2025 09:10

DePasqualeOrg force-pushed the vision-chat-templates branch from db97052 to 332551a Compare January 30, 2025 21:12

DePasqualeOrg force-pushed the vision-chat-templates branch 2 times, most recently from 0b746f4 to db883ff Compare January 30, 2025 21:16

DePasqualeOrg marked this pull request as ready for review January 30, 2025 21:17

DePasqualeOrg force-pushed the vision-chat-templates branch from db883ff to 408e7a8 Compare February 5, 2025 06:10

DePasqualeOrg marked this pull request as draft February 5, 2025 09:02

davidkoski reviewed Feb 5, 2025

View reviewed changes

Libraries/MLXVLM/Models/Qwen2VL.swift Outdated Show resolved Hide resolved

davidkoski mentioned this pull request Feb 6, 2025

adding support for Qwen2.5-VL #197

Closed

DePasqualeOrg force-pushed the vision-chat-templates branch 2 times, most recently from 7163c6a to f1208fd Compare February 8, 2025 08:03

DePasqualeOrg marked this pull request as ready for review February 8, 2025 08:05

Update swift-transformers

dbf2ead

DePasqualeOrg force-pushed the vision-chat-templates branch from f1208fd to c8d8d0a Compare February 8, 2025 08:26

DePasqualeOrg and others added 2 commits February 8, 2025 08:31

Use chat template for Qwen 2 VL

39800bc

remove duplicate entry

1cd252b

DePasqualeOrg force-pushed the vision-chat-templates branch from c8d8d0a to 1cd252b Compare February 8, 2025 08:32

Handle mix of images and videos

b66f312

davidkoski reviewed Feb 12, 2025

View reviewed changes

DePasqualeOrg added 2 commits February 12, 2025 17:54

Remove Qwen2VLProcessorError

908e3e2

Support images and videos in llm-tool

5532c58

davidkoski approved these changes Feb 12, 2025

View reviewed changes

davidkoski merged commit c743526 into ml-explore:main Feb 12, 2025
3 checks passed

Use chat templates for vision models #173

Use chat templates for vision models #173

Uh oh!

Conversation

DePasqualeOrg commented Jan 1, 2025

Uh oh!

Uh oh!

Uh oh!

DePasqualeOrg commented Jan 9, 2025

Uh oh!

Uh oh!

DePasqualeOrg commented Jan 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

DePasqualeOrg commented Jan 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

DePasqualeOrg commented Jan 14, 2025

Uh oh!

DePasqualeOrg commented Jan 15, 2025

Uh oh!

davidkoski commented Jan 15, 2025

Uh oh!

davidkoski commented Jan 15, 2025

Uh oh!

davidkoski commented Jan 27, 2025

Uh oh!

DePasqualeOrg commented Jan 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

DePasqualeOrg commented Jan 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

davidkoski commented Jan 28, 2025

Uh oh!

DePasqualeOrg commented Jan 30, 2025

Uh oh!

DePasqualeOrg commented Feb 5, 2025

Uh oh!

davidkoski commented Feb 5, 2025

Uh oh!

Uh oh!

DePasqualeOrg commented Feb 8, 2025

Uh oh!

davidkoski commented Feb 11, 2025

Uh oh!

DePasqualeOrg commented Feb 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

davidkoski commented Feb 12, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

davidkoski left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

DePasqualeOrg commented Feb 12, 2025

DePasqualeOrg commented Jan 14, 2025 •

edited

Loading

DePasqualeOrg commented Jan 14, 2025 •

edited

Loading

DePasqualeOrg commented Jan 28, 2025 •

edited

Loading

DePasqualeOrg commented Jan 28, 2025 •

edited

Loading

DePasqualeOrg commented Feb 12, 2025 •

edited

Loading