-
-
Notifications
You must be signed in to change notification settings - Fork 8.8k
[Bugfix] Fix tensor parallel issue in Qwen3 reranker weight loading #20682
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of Changes
Hello @yurhett, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request provides a critical bug fix for the Qwen3 reranker model, specifically addressing a tensor parallel dimension mismatch that occurred during weight loading when using multiple GPUs. By introducing tensor-parallel awareness and appropriate weight sharding logic into the relevant loading functions, the PR ensures that the model can now correctly load and operate in a distributed environment, resolving the RuntimeError
previously encountered without affecting single-GPU performance or existing APIs.
Highlights
- Tensor Parallelism Bugfix: Resolved a tensor parallel dimension mismatch error affecting the Qwen3-Reranker-4B model when
tensor_parallel_size > 1
, which previously prevented the model from loading correctly. - Tensor-Parallel Aware Weight Loading: Modified two sequence classification weight loading functions (
load_weights_using_from_2_way_softmax
andload_weights_no_post_processing
) to be tensor-parallel aware, ensuring correct weight distribution across GPUs. - Weight Sharding Logic: Implemented specific sharding logic for both single weight vectors (sharding along the hidden dimension) and weight matrices (sharding along the last dimension/hidden dimension) based on the current tensor parallel rank and size.
- Robustness and Compatibility: Added assertions to verify that hidden dimensions are divisible by the tensor parallel size for safe sharding. This fix has no impact on single-GPU usage (
tp_size=1
) and introduces no API changes.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command>
or @gemini-code-assist <command>
. Below is a summary of the supported commands.
Feature | Command | Description |
---|---|---|
Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/
folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request fixes a tensor parallel dimension mismatch issue in Qwen3-Reranker-4B model weight loading. The changes include tensor parallel rank and size detection, weight sharding logic, and assertions for safety. I suggested improving the error messages for better debugging.
assert weight.shape[0] % tp_size == 0, ( | ||
f"Hidden size {weight.shape[0]} must be divisible by tensor parallel size {tp_size}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider raising a more descriptive error message that includes the actual hidden size and tensor parallel size values for easier debugging.
assert weight.shape[0] % tp_size == 0, (
f"Hidden size {weight.shape[0]} must be divisible by tensor parallel size {tp_size}."
f"Got hidden_size={weight.shape[0]} and tp_size={tp_size}")
assert score_weight.shape[-1] % tp_size == 0, ( | ||
f"Hidden size {score_weight.shape[-1]} must be divisible by tensor parallel size {tp_size}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider raising a more descriptive error message that includes the actual hidden size and tensor parallel size values for easier debugging.
assert score_weight.shape[-1] % tp_size == 0, (
f"Hidden size {score_weight.shape[-1]} must be divisible by tensor parallel size {tp_size}"
f"Got hidden_size={score_weight.shape[-1]} and tp_size={tp_size}")
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
thanks for your contribution
You may need to install mteb[bm25s]>=1.38.11, <2 to run the tests
If possible, please help fix pp and dp as well
please pytest.skip before submitting added test. Until we have a test group dedicated to running multi-card pooling model tests |
Thank you @noooop for your guidance and providing the test details. I'm pleased to report that my current fix works well with both tensor parallelism (TP) and pipeline parallelism (PP) in my testing environment. Unfortunately, I wasn't able to test data parallelism (DP) as Regarding the test case, I regretfully need to inform you that I'm currently working in a completely air-gapped environment without internet access. This makes it extremely challenging to set up the testing environment as I would need to manually transfer each dependency file individually. Given the scope of this fix, this exceeds the resources I can currently allocate to this contribution. I appreciate your understanding of these constraints. If there's a simpler way to validate the changes or if someone with better connectivity could help with the test implementation, that would be most helpful. |
We need reproducible code to verify correctness. And ensure others don’t accidentally break it. Sorry, I can’t help you with the testing—hopefully someone else can. |
@Isotr0py are you able to help with this? I am quite busy nowadays |
I'm just catching up #20168, will take a look into this ASAP. |
Signed-off-by: Isotr0py <2037008807@qq.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For effectiveness, I directly pushed the changes with row_parallel_weight_loader
and adding tp tests for Qwen3 reranker.
The TP tests have passed on my side locally with 2 GPUs:
(VllmWorker rank=0 pid=38353) INFO 07-10 08:53:34 [gpu_model_runner.py:2329] Graph capturing finished in 4 secs, took 0.11 GiB
(VllmWorker rank=1 pid=38354) INFO 07-10 08:53:34 [gpu_model_runner.py:2329] Graph capturing finished in 4 secs, took 0.11 GiB
INFO 07-10 08:53:34 [core.py:172] init engine (profile, create kv cache, warmup model) took 43.81 seconds
INFO 07-10 08:53:35 [config.py:4631] Only "last" pooling supports chunked prefill and prefix caching; disabling both.
INFO 07-10 08:54:50 [config.py:3395] Upcasting torch.bfloat16 to torch.float32.
You're using a Qwen2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
VLLM: torch.float16 0.26708
SentenceTransformers: torch.float32 0.26573
Difference: -0.0013499999999999623
PASSED
tests/models/language/pooling/test_qwen3_reranker.py::test_rerank_models_mteb_tp[model_info1] Fork a new process to run a test 42847
Fork a new process to run a test 0
Skipping test.
PASSED
...
tests/models/language/pooling/test_qwen3_reranker.py::test_rerank_models_mteb_tp[model_info0]
tests/models/language/pooling/test_qwen3_reranker.py::test_rerank_models_mteb_tp[model_info1]
/kaggle/working/vllm/tests/utils.py:737: DeprecationWarning: This process (pid=38069) is multi-threaded, use of fork() may lead to deadlocks in the child.
pid = os.fork()
-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
================================================= 2 passed, 2 deselected, 558 warnings in 667.83s (0:11:07) =================================================
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nvm the test runs mteb
Just as I said in #19344.
in #19344 MTEB_RERANK_TOL = 1e-4 -> MTEB_RERANK_TOL = 1e-3 You can first change MTEB_RERANK_TOL = 2e-3 -> MTEB_RERANK_TOL = 1e-2 to make the test pass I am building a stronger RERANK test (╯‵□′)╯︵┻━┻ |
[2025-07-10T10:42:21Z] VLLM: torch.bfloat16 0.26717 [2025-07-11T09:49:28Z] VLLM: torch.bfloat16 0.26756 (╯‵□′)╯︵┻━┻ |
The lint and deploy CI is down currently, will update this PR again once #20812 merged to fix it. 😅 |
Signed-off-by: Isotr0py <2037008807@qq.com>
…llm-project#20682) Signed-off-by: Isotr0py <2037008807@qq.com> Co-authored-by: Isotr0py <2037008807@qq.com>
…llm-project#20682) Signed-off-by: Isotr0py <2037008807@qq.com> Co-authored-by: Isotr0py <2037008807@qq.com> Signed-off-by: Patrick von Platen <patrick.v.platen@gmail.com>
!Further testing and review may be required!
Purpose
tensor_parallel_size > 1
(fixes [Bug]: Tensor dimension mismatch when loading Qwen3-Reranker-4B with tensor parallel > 1 #20670)The model was failing with a tensor size mismatch error when trying to load with multiple GPUs:
Test Plan
python -m vllm.entrypoints.openai.api_server \ --model Qwen/Qwen3-Reranker-4B \ --task score \ --tensor_parallel_size 2 \ --hf_overrides '{"architectures":["Qwen3ForSequenceClassification"],"classifier_from_token":["no","yes"],"is_original_qwen3_reranker":true}'
Test Result
Everything works well.
Technical Details
The root cause was in
vllm/model_executor/models/adapters.py
where two sequence classification weight loading functions (load_weights_using_from_2_way_softmax
andload_weights_no_post_processing
) weren't tensor-parallel aware.Key changes:
This fix has: