[Bugfix] Fix tensor parallel issue in Qwen3 reranker weight loading #20682

yurhett · 2025-07-09T12:24:42Z

!Further testing and review may be required!

Purpose

Fix tensor parallel dimension mismatch for Qwen3-Reranker-4B model when using tensor_parallel_size > 1 (fixes [Bug]: Tensor dimension mismatch when loading Qwen3-Reranker-4B with tensor parallel > 1 #20670)

The model was failing with a tensor size mismatch error when trying to load with multiple GPUs:

RuntimeError: The size of tensor a (1280) must match the size of tensor b (2560) at non-singleton dimension 1

Test Plan

Run the model with tensor parallelism to verify the fix:

python -m vllm.entrypoints.openai.api_server \
  --model Qwen/Qwen3-Reranker-4B \
  --task score \
  --tensor_parallel_size 2 \
  --hf_overrides '{"architectures":["Qwen3ForSequenceClassification"],"classifier_from_token":["no","yes"],"is_original_qwen3_reranker":true}'

Test Result

Before fix:

RuntimeError: The size of tensor a (1280) must match the size of tensor b (2560) at non-singleton dimension 1

After fix:
Everything works well.

Technical Details

The root cause was in vllm/model_executor/models/adapters.py where two sequence classification weight loading functions (load_weights_using_from_2_way_softmax and load_weights_no_post_processing) weren't tensor-parallel aware.

Key changes:

Added tensor parallel rank and size detection
Added weight sharding logic:
- For single weight vectors: Shard along the hidden dimension
- For weight matrices: Shard along the last dimension
Added assertions for safety

This fix has:

No impact on single-GPU usage (tp_size=1)
No API changes

gemini-code-assist

Summary of Changes

Hello @yurhett, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request provides a critical bug fix for the Qwen3 reranker model, specifically addressing a tensor parallel dimension mismatch that occurred during weight loading when using multiple GPUs. By introducing tensor-parallel awareness and appropriate weight sharding logic into the relevant loading functions, the PR ensures that the model can now correctly load and operate in a distributed environment, resolving the RuntimeError previously encountered without affecting single-GPU performance or existing APIs.

Highlights

Tensor Parallelism Bugfix: Resolved a tensor parallel dimension mismatch error affecting the Qwen3-Reranker-4B model when tensor_parallel_size > 1, which previously prevented the model from loading correctly.
Tensor-Parallel Aware Weight Loading: Modified two sequence classification weight loading functions (load_weights_using_from_2_way_softmax and load_weights_no_post_processing) to be tensor-parallel aware, ensuring correct weight distribution across GPUs.
Weight Sharding Logic: Implemented specific sharding logic for both single weight vectors (sharding along the hidden dimension) and weight matrices (sharding along the last dimension/hidden dimension) based on the current tensor parallel rank and size.
Robustness and Compatibility: Added assertions to verify that hidden dimensions are divisible by the tensor parallel size for safe sharding. This fix has no impact on single-GPU usage (tp_size=1) and introduces no API changes.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request fixes a tensor parallel dimension mismatch issue in Qwen3-Reranker-4B model weight loading. The changes include tensor parallel rank and size detection, weight sharding logic, and assertions for safety. I suggested improving the error messages for better debugging.

gemini-code-assist · 2025-07-09T12:26:16Z

vllm/model_executor/models/adapters.py

+        assert weight.shape[0] % tp_size == 0, (
+            f"Hidden size {weight.shape[0]} must be divisible by tensor parallel size {tp_size}")


Consider raising a more descriptive error message that includes the actual hidden size and tensor parallel size values for easier debugging.

assert weight.shape[0] % tp_size == 0, ( f"Hidden size {weight.shape[0]} must be divisible by tensor parallel size {tp_size}." f"Got hidden_size={weight.shape[0]} and tp_size={tp_size}")

gemini-code-assist · 2025-07-09T12:26:16Z

vllm/model_executor/models/adapters.py

+        assert score_weight.shape[-1] % tp_size == 0, (
+            f"Hidden size {score_weight.shape[-1]} must be divisible by tensor parallel size {tp_size}")


Consider raising a more descriptive error message that includes the actual hidden size and tensor parallel size values for easier debugging.

assert score_weight.shape[-1] % tp_size == 0, ( f"Hidden size {score_weight.shape[-1]} must be divisible by tensor parallel size {tp_size}" f"Got hidden_size={score_weight.shape[-1]} and tp_size={tp_size}")

vllm/model_executor/models/adapters.py

github-actions · 2025-07-09T13:14:51Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

noooop · 2025-07-10T01:35:02Z

thanks for your contribution

The tests involved in this piece of code are

pytest -s -vvv tests/models/language/pooling/test_qwen3_reranker.py
pytest -s -vvv tests/models/language/pooling/test_bge_reranker_v2_gemma.py
pytest -s -vvv tests/models/language/pooling/test_mxbai_rerank.py

You may need to install mteb[bm25s]>=1.38.11, <2 to run the tests

After successfully running existing tests, add tests such as test_rerank_models_mteb_tp

If possible, please help fix pp and dp as well

Because all pooling model tests are now executed on a single card,

please pytest.skip before submitting added test.

Until we have a test group dedicated to running multi-card pooling model tests

yurhett · 2025-07-10T03:59:30Z

Thank you @noooop for your guidance and providing the test details.

I'm pleased to report that my current fix works well with both tensor parallelism (TP) and pipeline parallelism (PP) in my testing environment. Unfortunately, I wasn't able to test data parallelism (DP) as data-parallel-size-local cannot > 1.

Regarding the test case, I regretfully need to inform you that I'm currently working in a completely air-gapped environment without internet access. This makes it extremely challenging to set up the testing environment as I would need to manually transfer each dependency file individually. Given the scope of this fix, this exceeds the resources I can currently allocate to this contribution.

I appreciate your understanding of these constraints. If there's a simpler way to validate the changes or if someone with better connectivity could help with the test implementation, that would be most helpful.

noooop · 2025-07-10T06:26:08Z

We need reproducible code to verify correctness. And ensure others don’t accidentally break it. Sorry, I can’t help you with the testing—hopefully someone else can.

DarkLight1337 · 2025-07-10T07:03:11Z

@Isotr0py are you able to help with this? I am quite busy nowadays

Isotr0py · 2025-07-10T08:06:01Z

I'm just catching up #20168, will take a look into this ASAP.

Signed-off-by: Isotr0py <2037008807@qq.com>

Isotr0py

For effectiveness, I directly pushed the changes with row_parallel_weight_loader and adding tp tests for Qwen3 reranker.

The TP tests have passed on my side locally with 2 GPUs:

(VllmWorker rank=0 pid=38353) INFO 07-10 08:53:34 [gpu_model_runner.py:2329] Graph capturing finished in 4 secs, took 0.11 GiB
(VllmWorker rank=1 pid=38354) INFO 07-10 08:53:34 [gpu_model_runner.py:2329] Graph capturing finished in 4 secs, took 0.11 GiB
INFO 07-10 08:53:34 [core.py:172] init engine (profile, create kv cache, warmup model) took 43.81 seconds
INFO 07-10 08:53:35 [config.py:4631] Only "last" pooling supports chunked prefill and prefix caching; disabling both.
INFO 07-10 08:54:50 [config.py:3395] Upcasting torch.bfloat16 to torch.float32.                                                                              
You're using a Qwen2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
VLLM: torch.float16 0.26708
SentenceTransformers: torch.float32 0.26573
Difference: -0.0013499999999999623
PASSED
tests/models/language/pooling/test_qwen3_reranker.py::test_rerank_models_mteb_tp[model_info1] Fork a new process to run a test 42847
Fork a new process to run a test 0
Skipping test.
PASSED

...

tests/models/language/pooling/test_qwen3_reranker.py::test_rerank_models_mteb_tp[model_info0]
tests/models/language/pooling/test_qwen3_reranker.py::test_rerank_models_mteb_tp[model_info1]
  /kaggle/working/vllm/tests/utils.py:737: DeprecationWarning: This process (pid=38069) is multi-threaded, use of fork() may lead to deadlocks in the child.
    pid = os.fork()

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
================================================= 2 passed, 2 deselected, 558 warnings in 667.83s (0:11:07) =================================================

DarkLight1337 · 2025-07-10T10:00:04Z

~~It would be best to also have a correctness check, in case the weights were loaded with invalid values~~

DarkLight1337

Nvm the test runs mteb

noooop · 2025-07-11T02:51:23Z

@Isotr0py

Just as I said in #19344.

I don't know why, but the differences between the models on local and CI machines are greater than those between fp16 and fp32.

in #19344 MTEB_RERANK_TOL = 1e-4 -> MTEB_RERANK_TOL = 1e-3
in #20615 MTEB_RERANK_TOL = 1e-3 -> MTEB_RERANK_TOL = 2e-3

You can first change MTEB_RERANK_TOL = 2e-3 -> MTEB_RERANK_TOL = 1e-2 to make the test pass

I am building a stronger RERANK test (╯‵□′)╯︵┻━┻

Signed-off-by: Isotr0py <2037008807@qq.com>

noooop · 2025-07-11T11:30:19Z

[2025-07-10T10:42:21Z] VLLM: torch.bfloat16 0.26717
[2025-07-10T10:42:21Z] SentenceTransformers: torch.float32 0.25736
[2025-07-10T10:42:21Z] Difference: -0.009810000000000041
[2025-07-10T10:42:21Z]

[2025-07-11T09:49:28Z] VLLM: torch.bfloat16 0.26756
[2025-07-11T09:49:28Z] SentenceTransformers: torch.float32 0.25736
[2025-07-11T09:49:28Z] Difference: -0.010200000000000042
[2025-07-11T09:49:28Z]

(╯‵□′)╯︵┻━┻

Isotr0py · 2025-07-11T11:33:35Z

The lint and deploy CI is down currently, will update this PR again once #20812 merged to fix it. 😅

Signed-off-by: Isotr0py <2037008807@qq.com>

…llm-project#20682) Signed-off-by: Isotr0py <2037008807@qq.com> Co-authored-by: Isotr0py <2037008807@qq.com>

…llm-project#20682) Signed-off-by: Isotr0py <2037008807@qq.com> Co-authored-by: Isotr0py <2037008807@qq.com> Signed-off-by: Patrick von Platen <patrick.v.platen@gmail.com>

fix: Correctly calculate tensor parallel dimension

e9d8f73

gemini-code-assist bot reviewed Jul 9, 2025

View reviewed changes

mergify bot added the qwen Related to Qwen models label Jul 9, 2025

gemini-code-assist bot reviewed Jul 9, 2025

View reviewed changes

DarkLight1337 reviewed Jul 9, 2025

View reviewed changes

vllm/model_executor/models/adapters.py Outdated Show resolved Hide resolved

use weight loader and add tp test

b0116ab

Signed-off-by: Isotr0py <2037008807@qq.com>

Isotr0py requested a review from ywang96 as a code owner July 10, 2025 09:03

Isotr0py approved these changes Jul 10, 2025

View reviewed changes

Isotr0py added the ready ONLY add when PR is ready to merge/full CI is needed label Jul 10, 2025

Isotr0py enabled auto-merge (squash) July 10, 2025 09:45

DarkLight1337 approved these changes Jul 10, 2025

View reviewed changes

Isotr0py added 2 commits July 11, 2025 11:21

mteb atol

90c34b6

Signed-off-by: Isotr0py <2037008807@qq.com>

Merge remote-tracking branch 'upstream/main'

18039a5

Isotr0py added 2 commits July 11, 2025 22:20

Merge remote-tracking branch 'upstream/main'

7823e59

increase atol

b260e69

Signed-off-by: Isotr0py <2037008807@qq.com>

simon-mo disabled auto-merge July 12, 2025 03:52

simon-mo merged commit 11c0198 into vllm-project:main Jul 12, 2025
66 of 68 checks passed

Chen-zexi pushed a commit to Chen-zexi/vllm that referenced this pull request Jul 13, 2025

[Bugfix] Fix tensor parallel issue in Qwen3 reranker weight loading (v…

eb42508

…llm-project#20682) Signed-off-by: Isotr0py <2037008807@qq.com> Co-authored-by: Isotr0py <2037008807@qq.com>

		assert weight.shape[0] % tp_size == 0, (
		f"Hidden size {weight.shape[0]} must be divisible by tensor parallel size {tp_size}")

		assert score_weight.shape[-1] % tp_size == 0, (
		f"Hidden size {score_weight.shape[-1]} must be divisible by tensor parallel size {tp_size}")

Uh oh!

[Bugfix] Fix tensor parallel issue in Qwen3 reranker weight loading #20682

[Bugfix] Fix tensor parallel issue in Qwen3 reranker weight loading #20682

Conversation

yurhett commented Jul 9, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

!Further testing and review may be required!

Purpose

Test Plan

Test Result

Technical Details

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Jul 9, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jul 9, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

github-actions bot commented Jul 9, 2025

Uh oh!

noooop commented Jul 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yurhett commented Jul 10, 2025

Uh oh!

noooop commented Jul 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

DarkLight1337 commented Jul 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Isotr0py commented Jul 10, 2025

Uh oh!

Isotr0py left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

DarkLight1337 commented Jul 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

DarkLight1337 left a comment

Choose a reason for hiding this comment

Uh oh!

noooop commented Jul 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

noooop commented Jul 11, 2025

Uh oh!

Isotr0py commented Jul 11, 2025

Uh oh!

Uh oh!

Uh oh!

yurhett commented Jul 9, 2025 •

edited by github-actions bot

Loading

noooop commented Jul 10, 2025 •

edited

Loading

noooop commented Jul 10, 2025 •

edited

Loading

DarkLight1337 commented Jul 10, 2025 •

edited

Loading

Isotr0py left a comment •

edited

Loading

DarkLight1337 commented Jul 10, 2025 •

edited

Loading

noooop commented Jul 11, 2025 •

edited

Loading