Map Mistral-HF models back onto Mistral format on-the-fly #20471

sjuxax · 2025-07-04T06:11:41Z

Purpose

This is a WIP PR to begin the process of integrating the changes from my Mistral-3.1-rebase branch onto main. I've used this with success to utilize quantized checkpoints of Mistral-Small-3.1 and Mistral-Small-3.2 along with the Mistral tokenizer, enabling tool calling. We just take the transformers Mistral conversion script and invert its operations on checkpoint load, which means remapping the weight names and reversing their RoPE modifications.

It also requires some minor massaging of the config.json to successfully load, which you can see here: https://huggingface.co/jeffcookio/Mistral-Small-3.2-24B-Instruct-2506-awq-sym/commit/00a485337ebdba586119eacbcd6e54c3dff75c11

This allows one to use quantized Mistral-HF checkpoints with --tokenizer-mode mistral (not --load-format mistral or --config-format mistral) on startup. Both image and text modalities work great in my experience and are at least as good as running the HF checkpoints on main (but I haven't run any evals).

Things to improve to exit WIP status:

Quantized weights are not loaded intelligently/dynamically really, we just have a regex that remaps common quantized weight names at the end of MISTRAL3_REVERSE_MAPPING.
add_pre_mm_projector_layer_norm default is swapped to True to work around failure to read this value out of config.json. This breaks earlier Pixtrals.
QuantConfig explicitly set to None on multi_modal_projector instead of dynamically detecting whether its quantized in that particular checkpoint or not.
I'm not clear if we should use F.silu or nn.GELU for the activation on the forward pass (you can see commits going back and forth on this as I was testing). This is another change transformers makes in its implementation
No tests or documentation updates.
No sign-offs
No yapf formatting.
Mistral3ForConditionalGeneration is overridden in the model registry with a repeated registry key. We may need to introduce a special architecture name or an additional flag read out of config.json to allow users to specify whether we should try to load the checkpoint as Mistral-HF or Pixtral.

Note: requires #20503 for successful tool calling with Mistral-Small-3.2.

Creating this as WIP instead of just an ongoing branch on my fork because @mgoin expressed interest in it at https://vllm-dev.slack.com/archives/C07QP347J4D/p1751465349646219?thread_ts=1751399869.254259&cid=C07QP347J4D.

Please feel free to hack away on this, make suggestions, etc. I'd love to see it upstreamed so I don't have to keep maintaining it separately.

Test Plan

Write some tests to validate behavior.

Test Result

N/A yet.

(Optional) Documentation Update

Should document the ability to run Mistral-HF quantized checkpoints with --tokenizer-mode mistral, and additionally, should document that failure to do so prohibits tool calling.

Note this probably breaks other Mistral-formatted Pixtrals. Real fix hopefully coming soon.

We still keep our model loading debug prints, but we succesfully resolved the issues addressed by the samples of patch_conv, etc., so all they do now is spam the log. Remove them!

This reverts commit efee9cc2df2614595f147a1bcf3cfadd997ffae4.

github-actions · 2025-07-04T06:11:49Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

gemini-code-assist

Summary of Changes

Hello @sjuxax, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request is a significant step towards integrating Mistral-HF models into vLLM, allowing users to leverage Mistral-3.1 and 3.2 checkpoints, including quantized versions, with the existing Pixtral architecture. The core of the change involves dynamically remapping weight names and reversing specific architectural modifications made by Hugging Face during checkpoint loading, thereby expanding vLLM's model compatibility and enabling features like tool calling for these models.

Highlights

Model Compatibility: This pull request enables vLLM to load and utilize Mistral-HF models, specifically Mistral-Small-3.1 and Mistral-Small-3.2, by adapting their structure to the existing Pixtral model implementation. This includes support for both image and text modalities.
Weight Remapping and RoPE Reversal: I've introduced a comprehensive set of regex-based mappings to convert Hugging Face model weight names back to the original Pixtral format. Additionally, I've implemented logic to reverse the RoPE (Rotary Position Embeddings) modifications applied by Hugging Face for attention weights in the vision encoder, ensuring correct model loading.
Quantized Checkpoint Support: The changes facilitate the use of quantized Mistral-HF checkpoints. While the current implementation for handling quantized weights is noted as a work-in-progress, it lays the groundwork for improved efficiency.
Configuration Adjustments: I've adjusted the default value of add_pre_mm_projector_layer_norm to True in VisionEncoderArgs and explicitly set quant_config to None for multi-modal projector layers. These changes are necessary to successfully load and run the Mistral-HF models.
Registry Update: The model registry has been updated to map Mistral3ForConditionalGeneration to the pixtral implementation, allowing these models to be recognized and loaded by vLLM.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a valuable feature to remap Mistral-HF models to the native Mistral format on-the-fly, enabling support for newer quantized models. The core logic involves remapping weight names and reversing RoPE permutations during model loading.

While the approach is sound, the PR is in a WIP state and includes several temporary workarounds and debug artifacts. The most critical issue is a breaking change to a default parameter that affects backward compatibility. Other points to address include improving performance by pre-compiling regular expressions, ensuring consistent logger usage, and removing leftover debug code. Addressing these points will be crucial for moving this PR out of WIP and making it ready for merging.

gemini-code-assist · 2025-07-04T06:13:29Z

vllm/model_executor/models/pixtral.py

@@ -554,7 +619,7 @@ class VisionEncoderArgs:
    image_token_id: int
    adapter_bias: bool = True
    spatial_merge_size: int = 1
-    add_pre_mm_projector_layer_norm: bool = False
+    add_pre_mm_projector_layer_norm: bool = True


Changing the default value of add_pre_mm_projector_layer_norm to True is a breaking change for earlier Pixtral models. Revert this change and implement a mechanism to dynamically determine this value from the model's config.json.

Suggested change

add_pre_mm_projector_layer_norm: bool = True

add_pre_mm_projector_layer_norm: bool = False

gemini-code-assist · 2025-07-04T06:13:29Z

vllm/model_executor/models/mistral3.py

@@ -120,13 +120,13 @@ def __init__(self,
        self.linear_1 = ColumnParallelLinear(vision_hidden_size,
                                             text_hidden_size,
                                             bias=multimodal_projector_bias,
-                                             quant_config=quant_config,
+                                             quant_config=None,


The quant_config is hardcoded to None. Before finalizing, replace this with a dynamic check to ensure correctness for checkpoints that may have a quantized multi_modal_projector.

vllm/model_executor/models/pixtral.py

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

mratsim · 2025-07-04T08:46:29Z

Regarding:

This allows one to use quantized Mistral-HF checkpoints with --tokenizer-mode mistral (not --load-format mistral or --config-format mistral) on startup. Both image and text modalities work great in my experience and are at least as good as running the HF checkpoints on main (but I haven't run any evals).

This already works or am I missing something? For example (with 32GB of VRAM): https://huggingface.co/mratsim/Devstral-Small-2505.w4a16-gptq

export MODEL="mratsim/Devstral-Small-2505.w4a16-gptq"
vllm serve "${MODEL}" \
  --served-model-name devstral-32b \
  --gpu-memory-utilization 0.95 \
  --enable-prefix-caching \
  --enable-chunked-prefill \
  --max-model-len 94000 \
  --max_num_seqs 256 \
  --tokenizer_mode mistral \
  --generation-config "${MODEL}" \
  --enable-auto-tool-choice --tool-call-parser mistral

sjuxax · 2025-07-04T09:11:24Z

I've only tested Mistral-Small-3.1 and Mistral-Small-3.2. The code for remapping is currently within the Pixtral model, so I don't know if it'd work with one of the Mistral 3 models that doesn't have a vision encoder. Sorry if that was unclear!

In theory the remap should work fine for models that don't have vision, but we'd have to move it out of the Pixtral model and into the general Mistral3 model. Then we'd need to either register a custom mapping or update the code to detect when we're using Mistral-HF+Mistral tokenizer and execute the corresponding codepath. I can take a crack at some of this in a while.

I suggest you try it with a Mistral-Small-3.2 model, which should work. Note you may also need to merge #19425, and you'll need the config changes in the HuggingFace commit linked in the OP. Or you can just download and use https://huggingface.co/jeffcookio/Mistral-Small-3.2-24B-Instruct-2506-awq-sym.

sjuxax · 2025-07-04T20:37:50Z

Where mentioned, scratch #19425, see follow-up in #20503. I've updated the OP.

sjuxax added 9 commits July 3, 2025 23:37

Register Mistral3ForConditionalGeneration as Pixtral

4892156

Use Transformers Mistral3-based checkpoints as Pixtral/3.1 Small

821a7dc

Mistral3.1: silu->GELU to match transformers definition

00834b2

Hack around add_pre_mm_projector_layer_norm incorrect detection. ...

ec9a53b

Note this probably breaks other Mistral-formatted Pixtrals. Real fix hopefully coming soon.

Cleanup excessive/unneeded debug prints. ...

01c2d6c

We still keep our model loading debug prints, but we succesfully resolved the issues addressed by the samples of patch_conv, etc., so all they do now is spam the log. Remove them!

Use os.getenv instead of os.environ

39ac816

QuantConfig=None on multi_modal_projector

7bb4b13

Revert "Mistral3.1: silu->GELU to match transformers definition"

68d81a9

This reverts commit efee9cc2df2614595f147a1bcf3cfadd997ffae4.

Make Pixtral work again after typing updates

97acc5e

gemini-code-assist bot reviewed Jul 4, 2025

View reviewed changes

sjuxax and others added 3 commits July 4, 2025 00:16

Use init_logger per Gemini suggestion

c5d28d0

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

Clean up debug logs; rely on logger_init for logger.

36a2846

Apply yapf formatting via pre-commit

f277cf0

sjuxax mentioned this pull request Jul 4, 2025

[Bugfix] Mistral tool parser streaming update #19425

Open

mergify bot added the new-model Requests to new models label Jul 10, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Map Mistral-HF models back onto Mistral format on-the-fly #20471

Map Mistral-HF models back onto Mistral format on-the-fly #20471

sjuxax commented Jul 4, 2025 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Jul 4, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Jul 4, 2025

Uh oh!

gemini-code-assist bot Jul 4, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mratsim commented Jul 4, 2025 •

edited

Loading

Uh oh!

sjuxax commented Jul 4, 2025 •

edited

Loading

Uh oh!

sjuxax commented Jul 4, 2025

Uh oh!

Uh oh!

	add_pre_mm_projector_layer_norm: bool = True
	add_pre_mm_projector_layer_norm: bool = False

Uh oh!

Map Mistral-HF models back onto Mistral format on-the-fly #20471

Are you sure you want to change the base?

Map Mistral-HF models back onto Mistral format on-the-fly #20471

Conversation

sjuxax commented Jul 4, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

(Optional) Documentation Update

Uh oh!

github-actions bot commented Jul 4, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Jul 4, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jul 4, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mratsim commented Jul 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sjuxax commented Jul 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sjuxax commented Jul 4, 2025

Uh oh!

Uh oh!

sjuxax commented Jul 4, 2025 •

edited by github-actions bot

Loading

mratsim commented Jul 4, 2025 •

edited

Loading

sjuxax commented Jul 4, 2025 •

edited

Loading