-
Notifications
You must be signed in to change notification settings - Fork 235
Use chat templates for vision models #173
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use chat templates for vision models #173
Conversation
5c8ccfa
to
3ac6296
Compare
@davidkoski, I made some changes, and it seems to work in VLMEval. Do you have any thoughts on this? |
3ac6296
to
4547cf1
Compare
I think
|
The solution in my latest commit uses the chat template (correctly, I think) to create a prompt like this:
However, in order for the model to work, it looks like we need to replace the single let mergeLength = config.mergeSize * config.mergeSize
let repeatedPadding = Array(repeating: "<|image_pad|>", count: thw.product / mergeLength).joined() |
347ccc5
to
806c7f2
Compare
I now have something that works, although it still needs to take into account the case where multiple images are included. |
0f68fd2
to
ed03ae5
Compare
@davidkoski, I found it quite difficult to reason about the code because of how some of the variables and parameters were named. What do you think about calling an array of type |
it sounds ok to me, though they aren't the frames themselves but the positions of the frames in one of the arrays (maybe not in the final array). I think try |
Right, that is this part: I think the sequence from the python side is roughly:
One issue we have on the swift side is step 1 and step 3 occur in the same function in swift-transformers and we don't have a hook for step 2. |
e9c7a02
to
8cb233b
Compare
8959c45
to
3e50263
Compare
@DePasqualeOrg it looks like the swift-transformers side (which includes Jinja) is ready to go and would solve some issues with text models. Do you want to prepare a PR for picking that up (since it is mostly your work)? If you are busy I can get that ready. |
I think #185 accomplishes that. Xcode is showing the latest patch versions of the packages when I open mlx-swift-examples. Or is there something I'm missing? huggingface/swift-transformers#151 still needs to be merged before this PR, since it expands the type of a message from |
031e47f
to
db97052
Compare
I've verified that this also works with multiple images, although I'll need to do further testing to check the model's performance. I noticed that Qwen 2 VL tends to respond in Mandarin unless prompted otherwise. |
Yeah, I noticed that too. At least the responses seemed correct per google translate :-) |
db97052
to
332551a
Compare
0b746f4
to
db883ff
Compare
This is now ready for review. |
db883ff
to
408e7a8
Compare
I now need to make some significant changes because the video pull request was merged before this one. |
@DePasqualeOrg the build issue was a duplicate entry in mlx-swift-examples.xcodeproj/project.xcworkspace/xcshareddata/swiftpm/Package.resolved -- I removed that and pushed it to your branch. The conflicts remain, but it is back to building. |
7163c6a
to
f1208fd
Compare
This now works for images and videos and is ready for review. |
f1208fd
to
c8d8d0a
Compare
c8d8d0a
to
1cd252b
Compare
I will check this out tomorrow morning (18 hours from now)! |
I tested this in Local Chat, since it has a UI that allows me to add images and videos in the same prompt, and I noticed that my previous solution was causing the app to crash due to an error like this: It's worth noting that the model seems to "see" multiple images/videos as a single image, so in my experience it doesn't differentiate between images in its responses, but treats them as a combined image. |
I saw the same in python where I gave several images of my dog and it describe it as several dogs. I wonder if this is a problem in the model? As far as I can tell we are constructing the THW correctly, so it should work. Well, I guess it is at least consistent :-) |
} | ||
+ videos.map { _ in | ||
["type": "video"] | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if this should happen inside Qwen2VLProcessor? As it stands the llm-tool
doesn't work because it doesn't have this augmentation.
In the python code this is specific to the model, but handled outside the model/processing code. I think it belongs with the UserInputProcessor as that is where all of these would come together.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Leaving the construction of the messages to the app would make this more flexible. I can imagine that in the future there might be different message formats. I'm not certain about this, but this approach is working well for me in my app at the moment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As it is, it means the llm-tool
doesn't work for Qwen models -- it gets an error because it is missing the tokens.
On the python side this looks like:
# Model to format mapping
model_to_format = {
# Models using message_list_with_image format
"idefics2": "message_list_with_image",
"idefics3": "message_list_with_image",
"llava": "message_list_with_image",
"llava_next": "message_list_with_image",
"mllama": "message_list_with_image",
# Models that can handle both image and video formats
"qwen2_vl": (
"message_video_with_text"
if kwargs.get("video")
else "message_list_with_image"
),
"qwen2_5_vl": (
"message_video_with_text"
if kwargs.get("video")
else "message_list_with_image"
),
and this matches the swift code:
# Message format handlers
def handle_list_with_image():
content = [create_text_message(prompt)]
if role == "user" and not skip_image_token:
image_tokens = [{"type": "image"}] * num_images
content = (
image_tokens + content
if model_name in ["pixtral", "idefics3"]
else content + image_tokens
)
return {"role": role, "content": content}
So the problems of leaving it to the app are two-fold:
- each application/tool has to have a copy of this code
- the code varies per model type and the app needs to have a table mapping to the right message structure
Perhaps we could have a way to mark the messages as already being processed (or the UserInputProcessor could inspect the messages and detect that), leaving it up to the app, but I am not sure what the app would do different than the generic processing required by the model type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can imagine that in the future there might be different message formats.
Yes, for sure there are, but per the python code they vary by model type and somewhat by the presence of video vs images.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the issue might be multi-turn conversations. I don't think UserInput has a way to designate to which turn an image or video belongs, so perhaps this is best left to the app?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think that works ok -- we can get experience with it and take our time on the longer term approach. I suggest:
- add this code to llm-tool so that it can still work with Qwen
- add a comment to the code indicating that it may be model dependent (in case somebody adds a model where this doesn't work)
Then we can consider the approach at our leisure. Sound good?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As it is, apps already have to handle how the system message is treated in the construction of messages, since some models support a system role and others don't.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there is a lot of value getting the chat template code merged sooner rather than perfecting everything around it :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds great! I'll make the changes to llm-tool.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
llm-tool now works.
@@ -585,6 +585,10 @@ private enum Vision { | |||
/// This is meant to be used with ``Qwen2VL`` and is typically created by ``VLMModelFactory``. | |||
public class Qwen2VLProcessor: UserInputProcessor { | |||
|
|||
enum Qwen2VLProcessorError: Error { | |||
case framesIsNil | |||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is fine but perhaps should live here:
public enum VLMError: Error {
case imageRequired
case maskRequired
as it may be common to many models.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I checked, and in fact it's not being used, so I removed it. It must be left over from a previous iteration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome! Thank you so much for your effort on this here and in swift-transformers and jinja, I know this took a lot of time and work to see it through!
Thanks for your help too! |
This is a test of my PR to Swift Jinja, which should enable chat templates to be used for vision language models that have a chat template. I've started to set things up, but I need some pointers on how to integrate the image into the messages.