Skip to content

Extend VisionAddOn Pattern to Qwen2.5VL #167

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
mattjcly opened this issue May 27, 2025 · 0 comments
Open

Extend VisionAddOn Pattern to Qwen2.5VL #167

mattjcly opened this issue May 27, 2025 · 0 comments
Labels
enhancement New feature or request

Comments

@mattjcly
Copy link
Member

Currently, the only multi-modal models that have been migrated to the "unified" architecture are Gemma3 and Pixtral:

VISION_ADD_ON_MAP = {
"gemma3": Gemma3VisionAddOn,
"pixtral": PixtralVisionAddOn,
}

Extending this pattern to Qwen2.5VL/Qwen2VL is desired.

Relevant mlx-vlm components:

Relevant mlx-lm components:

This will likely look like:

  1. Ensure Qwen2.5VL text model architecture is implemented correctly in mlx-lm (including MRoPE, see https://arxiv.org/abs/2502.13923 for details and Apply PR #319 fixes to Qwen 2.5VL position id #349 for mlx-vlm in progress work)
  2. Implement Qwen2_5_VLVisionAddOn and wire it in ModelKit
  3. Ensure Qwen2.5VL tests in mlx-engine still pass
@mattjcly mattjcly added the enhancement New feature or request label May 27, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant