Skip to content

[Bugfix] Fix tensor parallel issue in Qwen3 reranker weight loading #20682

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Jul 12, 2025

Conversation

yurhett
Copy link
Contributor

@yurhett yurhett commented Jul 9, 2025

!Further testing and review may be required!

Purpose

The model was failing with a tensor size mismatch error when trying to load with multiple GPUs:

RuntimeError: The size of tensor a (1280) must match the size of tensor b (2560) at non-singleton dimension 1

Test Plan

  • Run the model with tensor parallelism to verify the fix:
python -m vllm.entrypoints.openai.api_server \
  --model Qwen/Qwen3-Reranker-4B \
  --task score \
  --tensor_parallel_size 2 \
  --hf_overrides '{"architectures":["Qwen3ForSequenceClassification"],"classifier_from_token":["no","yes"],"is_original_qwen3_reranker":true}'

Test Result

  • Before fix:
RuntimeError: The size of tensor a (1280) must match the size of tensor b (2560) at non-singleton dimension 1
  • After fix:
    Everything works well.

Technical Details

The root cause was in vllm/model_executor/models/adapters.py where two sequence classification weight loading functions (load_weights_using_from_2_way_softmax and load_weights_no_post_processing) weren't tensor-parallel aware.

Key changes:

  1. Added tensor parallel rank and size detection
  2. Added weight sharding logic:
    • For single weight vectors: Shard along the hidden dimension
    • For weight matrices: Shard along the last dimension
  3. Added assertions for safety

This fix has:

  • No impact on single-GPU usage (tp_size=1)
  • No API changes

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @yurhett, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request provides a critical bug fix for the Qwen3 reranker model, specifically addressing a tensor parallel dimension mismatch that occurred during weight loading when using multiple GPUs. By introducing tensor-parallel awareness and appropriate weight sharding logic into the relevant loading functions, the PR ensures that the model can now correctly load and operate in a distributed environment, resolving the RuntimeError previously encountered without affecting single-GPU performance or existing APIs.

Highlights

  • Tensor Parallelism Bugfix: Resolved a tensor parallel dimension mismatch error affecting the Qwen3-Reranker-4B model when tensor_parallel_size > 1, which previously prevented the model from loading correctly.
  • Tensor-Parallel Aware Weight Loading: Modified two sequence classification weight loading functions (load_weights_using_from_2_way_softmax and load_weights_no_post_processing) to be tensor-parallel aware, ensuring correct weight distribution across GPUs.
  • Weight Sharding Logic: Implemented specific sharding logic for both single weight vectors (sharding along the hidden dimension) and weight matrices (sharding along the last dimension/hidden dimension) based on the current tensor parallel rank and size.
  • Robustness and Compatibility: Added assertions to verify that hidden dimensions are divisible by the tensor parallel size for safe sharding. This fix has no impact on single-GPU usage (tp_size=1) and introduces no API changes.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@mergify mergify bot added the qwen Related to Qwen models label Jul 9, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request fixes a tensor parallel dimension mismatch issue in Qwen3-Reranker-4B model weight loading. The changes include tensor parallel rank and size detection, weight sharding logic, and assertions for safety. I suggested improving the error messages for better debugging.

Comment on lines 367 to 368
assert weight.shape[0] % tp_size == 0, (
f"Hidden size {weight.shape[0]} must be divisible by tensor parallel size {tp_size}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Consider raising a more descriptive error message that includes the actual hidden size and tensor parallel size values for easier debugging.

assert weight.shape[0] % tp_size == 0, (
            f"Hidden size {weight.shape[0]} must be divisible by tensor parallel size {tp_size}."
            f"Got hidden_size={weight.shape[0]} and tp_size={tp_size}")

Comment on lines 425 to 426
assert score_weight.shape[-1] % tp_size == 0, (
f"Hidden size {score_weight.shape[-1]} must be divisible by tensor parallel size {tp_size}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Consider raising a more descriptive error message that includes the actual hidden size and tensor parallel size values for easier debugging.

assert score_weight.shape[-1] % tp_size == 0, (
            f"Hidden size {score_weight.shape[-1]} must be divisible by tensor parallel size {tp_size}"
            f"Got hidden_size={score_weight.shape[-1]} and tp_size={tp_size}")

Copy link

github-actions bot commented Jul 9, 2025

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

@noooop
Copy link
Contributor

noooop commented Jul 10, 2025

thanks for your contribution

  1. The tests involved in this piece of code are
  • pytest -s -vvv tests/models/language/pooling/test_qwen3_reranker.py
  • pytest -s -vvv tests/models/language/pooling/test_bge_reranker_v2_gemma.py
  • pytest -s -vvv tests/models/language/pooling/test_mxbai_rerank.py

You may need to install mteb[bm25s]>=1.38.11, <2 to run the tests

  1. After successfully running existing tests, add tests such as test_rerank_models_mteb_tp

If possible, please help fix pp and dp as well

  1. Because all pooling model tests are now executed on a single card,

please pytest.skip before submitting added test.

Until we have a test group dedicated to running multi-card pooling model tests

@yurhett
Copy link
Contributor Author

yurhett commented Jul 10, 2025

Thank you @noooop for your guidance and providing the test details.

I'm pleased to report that my current fix works well with both tensor parallelism (TP) and pipeline parallelism (PP) in my testing environment. Unfortunately, I wasn't able to test data parallelism (DP) as data-parallel-size-local cannot > 1.

Regarding the test case, I regretfully need to inform you that I'm currently working in a completely air-gapped environment without internet access. This makes it extremely challenging to set up the testing environment as I would need to manually transfer each dependency file individually. Given the scope of this fix, this exceeds the resources I can currently allocate to this contribution.

I appreciate your understanding of these constraints. If there's a simpler way to validate the changes or if someone with better connectivity could help with the test implementation, that would be most helpful.

@noooop
Copy link
Contributor

noooop commented Jul 10, 2025

We need reproducible code to verify correctness. And ensure others don’t accidentally break it. Sorry, I can’t help you with the testing—hopefully someone else can.

@DarkLight1337
Copy link
Member

DarkLight1337 commented Jul 10, 2025

@Isotr0py are you able to help with this? I am quite busy nowadays

@Isotr0py
Copy link
Collaborator

I'm just catching up #20168, will take a look into this ASAP.

Signed-off-by: Isotr0py <2037008807@qq.com>
@Isotr0py Isotr0py requested a review from ywang96 as a code owner July 10, 2025 09:03
Copy link
Collaborator

@Isotr0py Isotr0py left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For effectiveness, I directly pushed the changes with row_parallel_weight_loader and adding tp tests for Qwen3 reranker.

The TP tests have passed on my side locally with 2 GPUs:

(VllmWorker rank=0 pid=38353) INFO 07-10 08:53:34 [gpu_model_runner.py:2329] Graph capturing finished in 4 secs, took 0.11 GiB
(VllmWorker rank=1 pid=38354) INFO 07-10 08:53:34 [gpu_model_runner.py:2329] Graph capturing finished in 4 secs, took 0.11 GiB
INFO 07-10 08:53:34 [core.py:172] init engine (profile, create kv cache, warmup model) took 43.81 seconds
INFO 07-10 08:53:35 [config.py:4631] Only "last" pooling supports chunked prefill and prefix caching; disabling both.
INFO 07-10 08:54:50 [config.py:3395] Upcasting torch.bfloat16 to torch.float32.                                                                              
You're using a Qwen2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
VLLM: torch.float16 0.26708
SentenceTransformers: torch.float32 0.26573
Difference: -0.0013499999999999623
PASSED
tests/models/language/pooling/test_qwen3_reranker.py::test_rerank_models_mteb_tp[model_info1] Fork a new process to run a test 42847
Fork a new process to run a test 0
Skipping test.
PASSED

...

tests/models/language/pooling/test_qwen3_reranker.py::test_rerank_models_mteb_tp[model_info0]
tests/models/language/pooling/test_qwen3_reranker.py::test_rerank_models_mteb_tp[model_info1]
  /kaggle/working/vllm/tests/utils.py:737: DeprecationWarning: This process (pid=38069) is multi-threaded, use of fork() may lead to deadlocks in the child.
    pid = os.fork()

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
================================================= 2 passed, 2 deselected, 558 warnings in 667.83s (0:11:07) =================================================

@Isotr0py Isotr0py added the ready ONLY add when PR is ready to merge/full CI is needed label Jul 10, 2025
@Isotr0py Isotr0py enabled auto-merge (squash) July 10, 2025 09:45
@DarkLight1337
Copy link
Member

DarkLight1337 commented Jul 10, 2025

It would be best to also have a correctness check, in case the weights were loaded with invalid values

Copy link
Member

@DarkLight1337 DarkLight1337 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nvm the test runs mteb

@noooop
Copy link
Contributor

noooop commented Jul 11, 2025

@Isotr0py

Just as I said in #19344.

I don't know why, but the differences between the models on local and CI machines are greater than those between fp16 and fp32.

in #19344 MTEB_RERANK_TOL = 1e-4 -> MTEB_RERANK_TOL = 1e-3
in #20615 MTEB_RERANK_TOL = 1e-3 -> MTEB_RERANK_TOL = 2e-3

You can first change MTEB_RERANK_TOL = 2e-3 -> MTEB_RERANK_TOL = 1e-2 to make the test pass

I am building a stronger RERANK test (╯‵□′)╯︵┻━┻

Isotr0py added 2 commits July 11, 2025 11:21
Signed-off-by: Isotr0py <2037008807@qq.com>
@noooop
Copy link
Contributor

noooop commented Jul 11, 2025

[2025-07-10T10:42:21Z] VLLM: torch.bfloat16 0.26717
[2025-07-10T10:42:21Z] SentenceTransformers: torch.float32 0.25736
[2025-07-10T10:42:21Z] Difference: -0.009810000000000041
[2025-07-10T10:42:21Z]

[2025-07-11T09:49:28Z] VLLM: torch.bfloat16 0.26756
[2025-07-11T09:49:28Z] SentenceTransformers: torch.float32 0.25736
[2025-07-11T09:49:28Z] Difference: -0.010200000000000042
[2025-07-11T09:49:28Z]

(╯‵□′)╯︵┻━┻

@Isotr0py
Copy link
Collaborator

The lint and deploy CI is down currently, will update this PR again once #20812 merged to fix it. 😅

Isotr0py added 2 commits July 11, 2025 22:20
Signed-off-by: Isotr0py <2037008807@qq.com>
@simon-mo simon-mo disabled auto-merge July 12, 2025 03:52
@simon-mo simon-mo merged commit 11c0198 into vllm-project:main Jul 12, 2025
66 of 68 checks passed
Chen-zexi pushed a commit to Chen-zexi/vllm that referenced this pull request Jul 13, 2025
…llm-project#20682)

Signed-off-by: Isotr0py <2037008807@qq.com>
Co-authored-by: Isotr0py <2037008807@qq.com>
patrickvonplaten pushed a commit to patrickvonplaten/vllm that referenced this pull request Jul 15, 2025
…llm-project#20682)

Signed-off-by: Isotr0py <2037008807@qq.com>
Co-authored-by: Isotr0py <2037008807@qq.com>
Signed-off-by: Patrick von Platen <patrick.v.platen@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
qwen Related to Qwen models ready ONLY add when PR is ready to merge/full CI is needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug]: Tensor dimension mismatch when loading Qwen3-Reranker-4B with tensor parallel > 1
5 participants