-
-
Notifications
You must be signed in to change notification settings - Fork 8.7k
Map Mistral-HF models back onto Mistral format on-the-fly #20471
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Note this probably breaks other Mistral-formatted Pixtrals. Real fix hopefully coming soon.
We still keep our model loading debug prints, but we succesfully resolved the issues addressed by the samples of patch_conv, etc., so all they do now is spam the log. Remove them!
This reverts commit efee9cc2df2614595f147a1bcf3cfadd997ffae4.
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of Changes
Hello @sjuxax, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request is a significant step towards integrating Mistral-HF models into vLLM, allowing users to leverage Mistral-3.1 and 3.2 checkpoints, including quantized versions, with the existing Pixtral architecture. The core of the change involves dynamically remapping weight names and reversing specific architectural modifications made by Hugging Face during checkpoint loading, thereby expanding vLLM's model compatibility and enabling features like tool calling for these models.
Highlights
- Model Compatibility: This pull request enables vLLM to load and utilize Mistral-HF models, specifically Mistral-Small-3.1 and Mistral-Small-3.2, by adapting their structure to the existing Pixtral model implementation. This includes support for both image and text modalities.
- Weight Remapping and RoPE Reversal: I've introduced a comprehensive set of regex-based mappings to convert Hugging Face model weight names back to the original Pixtral format. Additionally, I've implemented logic to reverse the RoPE (Rotary Position Embeddings) modifications applied by Hugging Face for attention weights in the vision encoder, ensuring correct model loading.
- Quantized Checkpoint Support: The changes facilitate the use of quantized Mistral-HF checkpoints. While the current implementation for handling quantized weights is noted as a work-in-progress, it lays the groundwork for improved efficiency.
- Configuration Adjustments: I've adjusted the default value of
add_pre_mm_projector_layer_norm
toTrue
inVisionEncoderArgs
and explicitly setquant_config
toNone
for multi-modal projector layers. These changes are necessary to successfully load and run the Mistral-HF models. - Registry Update: The model registry has been updated to map
Mistral3ForConditionalGeneration
to thepixtral
implementation, allowing these models to be recognized and loaded by vLLM.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command>
or @gemini-code-assist <command>
. Below is a summary of the supported commands.
Feature | Command | Description |
---|---|---|
Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/
folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a valuable feature to remap Mistral-HF models to the native Mistral format on-the-fly, enabling support for newer quantized models. The core logic involves remapping weight names and reversing RoPE permutations during model loading.
While the approach is sound, the PR is in a WIP state and includes several temporary workarounds and debug artifacts. The most critical issue is a breaking change to a default parameter that affects backward compatibility. Other points to address include improving performance by pre-compiling regular expressions, ensuring consistent logger usage, and removing leftover debug code. Addressing these points will be crucial for moving this PR out of WIP and making it ready for merging.
@@ -554,7 +619,7 @@ class VisionEncoderArgs: | |||
image_token_id: int | |||
adapter_bias: bool = True | |||
spatial_merge_size: int = 1 | |||
add_pre_mm_projector_layer_norm: bool = False | |||
add_pre_mm_projector_layer_norm: bool = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changing the default value of add_pre_mm_projector_layer_norm
to True
is a breaking change for earlier Pixtral models. Revert this change and implement a mechanism to dynamically determine this value from the model's config.json
.
add_pre_mm_projector_layer_norm: bool = True | |
add_pre_mm_projector_layer_norm: bool = False |
@@ -120,13 +120,13 @@ def __init__(self, | |||
self.linear_1 = ColumnParallelLinear(vision_hidden_size, | |||
text_hidden_size, | |||
bias=multimodal_projector_bias, | |||
quant_config=quant_config, | |||
quant_config=None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Regarding:
This already works or am I missing something? For example (with 32GB of VRAM): https://huggingface.co/mratsim/Devstral-Small-2505.w4a16-gptq
|
I've only tested Mistral-Small-3.1 and Mistral-Small-3.2. The code for remapping is currently within the Pixtral model, so I don't know if it'd work with one of the Mistral 3 models that doesn't have a vision encoder. Sorry if that was unclear! In theory the remap should work fine for models that don't have vision, but we'd have to move it out of the Pixtral model and into the general Mistral3 model. Then we'd need to either register a custom mapping or update the code to detect when we're using Mistral-HF+Mistral tokenizer and execute the corresponding codepath. I can take a crack at some of this in a while. I suggest you try it with a |
Purpose
This is a WIP PR to begin the process of integrating the changes from my Mistral-3.1-rebase branch onto main. I've used this with success to utilize quantized checkpoints of Mistral-Small-3.1 and Mistral-Small-3.2 along with the Mistral tokenizer, enabling tool calling. We just take the transformers Mistral conversion script and invert its operations on checkpoint load, which means remapping the weight names and reversing their RoPE modifications.
It also requires some minor massaging of the
config.json
to successfully load, which you can see here: https://huggingface.co/jeffcookio/Mistral-Small-3.2-24B-Instruct-2506-awq-sym/commit/00a485337ebdba586119eacbcd6e54c3dff75c11This allows one to use quantized Mistral-HF checkpoints with
--tokenizer-mode mistral
(not--load-format mistral
or--config-format mistral
) on startup. Both image and text modalities work great in my experience and are at least as good as running the HF checkpoints on main (but I haven't run any evals).Things to improve to exit WIP status:
MISTRAL3_REVERSE_MAPPING
.add_pre_mm_projector_layer_norm
default is swapped toTrue
to work around failure to read this value out ofconfig.json
. This breaks earlier Pixtrals.QuantConfig
explicitly set toNone
onmulti_modal_projector
instead of dynamically detecting whether its quantized in that particular checkpoint or not.F.silu
ornn.GELU
for the activation on the forward pass (you can see commits going back and forth on this as I was testing). This is another changetransformers
makes in its implementationyapf
formatting.Mistral3ForConditionalGeneration
is overridden in the model registry with a repeated registry key. We may need to introduce a special architecture name or an additional flag read out ofconfig.json
to allow users to specify whether we should try to load the checkpoint as Mistral-HF or Pixtral.Note: requires #20503 for successful tool calling with Mistral-Small-3.2.
Creating this as WIP instead of just an ongoing branch on my fork because @mgoin expressed interest in it at https://vllm-dev.slack.com/archives/C07QP347J4D/p1751465349646219?thread_ts=1751399869.254259&cid=C07QP347J4D.
Please feel free to hack away on this, make suggestions, etc. I'd love to see it upstreamed so I don't have to keep maintaining it separately.
Test Plan
Write some tests to validate behavior.
Test Result
N/A yet.
(Optional) Documentation Update
Should document the ability to run Mistral-HF quantized checkpoints with
--tokenizer-mode mistral
, and additionally, should document that failure to do so prohibits tool calling.