Skip to content

Map Mistral-HF models back onto Mistral format on-the-fly #20471

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 12 commits into
base: main
Choose a base branch
from

Conversation

sjuxax
Copy link
Contributor

@sjuxax sjuxax commented Jul 4, 2025

Purpose

This is a WIP PR to begin the process of integrating the changes from my Mistral-3.1-rebase branch onto main. I've used this with success to utilize quantized checkpoints of Mistral-Small-3.1 and Mistral-Small-3.2 along with the Mistral tokenizer, enabling tool calling. We just take the transformers Mistral conversion script and invert its operations on checkpoint load, which means remapping the weight names and reversing their RoPE modifications.

It also requires some minor massaging of the config.json to successfully load, which you can see here: https://huggingface.co/jeffcookio/Mistral-Small-3.2-24B-Instruct-2506-awq-sym/commit/00a485337ebdba586119eacbcd6e54c3dff75c11

This allows one to use quantized Mistral-HF checkpoints with --tokenizer-mode mistral (not --load-format mistral or --config-format mistral) on startup. Both image and text modalities work great in my experience and are at least as good as running the HF checkpoints on main (but I haven't run any evals).

Things to improve to exit WIP status:

  • Quantized weights are not loaded intelligently/dynamically really, we just have a regex that remaps common quantized weight names at the end of MISTRAL3_REVERSE_MAPPING.
  • add_pre_mm_projector_layer_norm default is swapped to True to work around failure to read this value out of config.json. This breaks earlier Pixtrals.
  • QuantConfig explicitly set to None on multi_modal_projector instead of dynamically detecting whether its quantized in that particular checkpoint or not.
  • I'm not clear if we should use F.silu or nn.GELU for the activation on the forward pass (you can see commits going back and forth on this as I was testing). This is another change transformers makes in its implementation
  • No tests or documentation updates.
  • No sign-offs
  • No yapf formatting.
  • Mistral3ForConditionalGeneration is overridden in the model registry with a repeated registry key. We may need to introduce a special architecture name or an additional flag read out of config.json to allow users to specify whether we should try to load the checkpoint as Mistral-HF or Pixtral.

Note: requires #20503 for successful tool calling with Mistral-Small-3.2.

Creating this as WIP instead of just an ongoing branch on my fork because @mgoin expressed interest in it at https://vllm-dev.slack.com/archives/C07QP347J4D/p1751465349646219?thread_ts=1751399869.254259&cid=C07QP347J4D.

Please feel free to hack away on this, make suggestions, etc. I'd love to see it upstreamed so I don't have to keep maintaining it separately.

Test Plan

Write some tests to validate behavior.

Test Result

N/A yet.

(Optional) Documentation Update

Should document the ability to run Mistral-HF quantized checkpoints with --tokenizer-mode mistral, and additionally, should document that failure to do so prohibits tool calling.

Copy link

github-actions bot commented Jul 4, 2025

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @sjuxax, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request is a significant step towards integrating Mistral-HF models into vLLM, allowing users to leverage Mistral-3.1 and 3.2 checkpoints, including quantized versions, with the existing Pixtral architecture. The core of the change involves dynamically remapping weight names and reversing specific architectural modifications made by Hugging Face during checkpoint loading, thereby expanding vLLM's model compatibility and enabling features like tool calling for these models.

Highlights

  • Model Compatibility: This pull request enables vLLM to load and utilize Mistral-HF models, specifically Mistral-Small-3.1 and Mistral-Small-3.2, by adapting their structure to the existing Pixtral model implementation. This includes support for both image and text modalities.
  • Weight Remapping and RoPE Reversal: I've introduced a comprehensive set of regex-based mappings to convert Hugging Face model weight names back to the original Pixtral format. Additionally, I've implemented logic to reverse the RoPE (Rotary Position Embeddings) modifications applied by Hugging Face for attention weights in the vision encoder, ensuring correct model loading.
  • Quantized Checkpoint Support: The changes facilitate the use of quantized Mistral-HF checkpoints. While the current implementation for handling quantized weights is noted as a work-in-progress, it lays the groundwork for improved efficiency.
  • Configuration Adjustments: I've adjusted the default value of add_pre_mm_projector_layer_norm to True in VisionEncoderArgs and explicitly set quant_config to None for multi-modal projector layers. These changes are necessary to successfully load and run the Mistral-HF models.
  • Registry Update: The model registry has been updated to map Mistral3ForConditionalGeneration to the pixtral implementation, allowing these models to be recognized and loaded by vLLM.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a valuable feature to remap Mistral-HF models to the native Mistral format on-the-fly, enabling support for newer quantized models. The core logic involves remapping weight names and reversing RoPE permutations during model loading.

While the approach is sound, the PR is in a WIP state and includes several temporary workarounds and debug artifacts. The most critical issue is a breaking change to a default parameter that affects backward compatibility. Other points to address include improving performance by pre-compiling regular expressions, ensuring consistent logger usage, and removing leftover debug code. Addressing these points will be crucial for moving this PR out of WIP and making it ready for merging.

@@ -554,7 +619,7 @@ class VisionEncoderArgs:
image_token_id: int
adapter_bias: bool = True
spatial_merge_size: int = 1
add_pre_mm_projector_layer_norm: bool = False
add_pre_mm_projector_layer_norm: bool = True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

Changing the default value of add_pre_mm_projector_layer_norm to True is a breaking change for earlier Pixtral models. Revert this change and implement a mechanism to dynamically determine this value from the model's config.json.

Suggested change
add_pre_mm_projector_layer_norm: bool = True
add_pre_mm_projector_layer_norm: bool = False

@@ -120,13 +120,13 @@ def __init__(self,
self.linear_1 = ColumnParallelLinear(vision_hidden_size,
text_hidden_size,
bias=multimodal_projector_bias,
quant_config=quant_config,
quant_config=None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The quant_config is hardcoded to None. Before finalizing, replace this with a dynamic check to ensure correctness for checkpoints that may have a quantized multi_modal_projector.

sjuxax and others added 3 commits July 4, 2025 00:16
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
@mratsim
Copy link

mratsim commented Jul 4, 2025

Regarding:

This allows one to use quantized Mistral-HF checkpoints with --tokenizer-mode mistral (not --load-format mistral or --config-format mistral) on startup. Both image and text modalities work great in my experience and are at least as good as running the HF checkpoints on main (but I haven't run any evals).

This already works or am I missing something? For example (with 32GB of VRAM): https://huggingface.co/mratsim/Devstral-Small-2505.w4a16-gptq

export MODEL="mratsim/Devstral-Small-2505.w4a16-gptq"
vllm serve "${MODEL}" \
  --served-model-name devstral-32b \
  --gpu-memory-utilization 0.95 \
  --enable-prefix-caching \
  --enable-chunked-prefill \
  --max-model-len 94000 \
  --max_num_seqs 256 \
  --tokenizer_mode mistral \
  --generation-config "${MODEL}" \
  --enable-auto-tool-choice --tool-call-parser mistral

@sjuxax
Copy link
Contributor Author

sjuxax commented Jul 4, 2025

I've only tested Mistral-Small-3.1 and Mistral-Small-3.2. The code for remapping is currently within the Pixtral model, so I don't know if it'd work with one of the Mistral 3 models that doesn't have a vision encoder. Sorry if that was unclear!

In theory the remap should work fine for models that don't have vision, but we'd have to move it out of the Pixtral model and into the general Mistral3 model. Then we'd need to either register a custom mapping or update the code to detect when we're using Mistral-HF+Mistral tokenizer and execute the corresponding codepath. I can take a crack at some of this in a while.

I suggest you try it with a Mistral-Small-3.2 model, which should work. Note you may also need to merge #19425, and you'll need the config changes in the HuggingFace commit linked in the OP. Or you can just download and use https://huggingface.co/jeffcookio/Mistral-Small-3.2-24B-Instruct-2506-awq-sym.

@sjuxax
Copy link
Contributor Author

sjuxax commented Jul 4, 2025

Where mentioned, scratch #19425, see follow-up in #20503. I've updated the OP.

@mergify mergify bot added the new-model Requests to new models label Jul 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
new-model Requests to new models
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants