Extend `VisionAddOn` Pattern to Qwen2.5VL #167

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Open

mattjcly opened this issue May 27, 2025 · 0 comments

Labels

Member

mattjcly commented May 27, 2025

Currently, the only multi-modal models that have been migrated to the "unified" architecture are Gemma3 and Pixtral:

mlx-engine/mlx_engine/model_kit/model_kit.py

Lines 35 to 38 in ecc2cf4

    
           VISION_ADD_ON_MAP = { 
        
               "gemma3": Gemma3VisionAddOn, 
        
               "pixtral": PixtralVisionAddOn, 
        
           }

Extending this pattern to Qwen2.5VL/Qwen2VL is desired.

Relevant mlx-vlm components:

Relevant mlx-lm components:

https://github.com/ml-explore/mlx-lm/blob/77edf17bc0bf7c9313e0b970490db86a4f64bee4/mlx_lm/models/qwen2.py

This will likely look like:

Ensure Qwen2.5VL text model architecture is implemented correctly in mlx-lm (including MRoPE, see https://arxiv.org/abs/2502.13923 for details and Apply PR #319 fixes to Qwen 2.5VL position id #349 for mlx-vlm in progress work)
Implement Qwen2_5_VLVisionAddOn and wire it in ModelKit
Ensure Qwen2.5VL tests in mlx-engine still pass

The text was updated successfully, but these errors were encountered:

mattjcly added the enhancement label

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment