Skip to content

[Bug]: Qwen/Qwen3-Reranker-0.6B Qwen 3 based reranking models are working #20532

@distributedlock

Description

@distributedlock

Your current environment

vLLM Production Stack Helm chart
K8s
CUDA 12.8
Nvidia GPUs

🐛 Describe the bug

Following #19260, I tried to implement reranking with Qwen/Qwen3-Reranker-0.6B, Qwen/Qwen3-Reranker-4B, and tomaarsen/Qwen3-Reranker-0.6B-seq-cls in vLLM Product Stack Helm chart on k8s, but ended up facing these errors.

This is my Helm setup:

servingEngineSpec:
  runtimeClassName: ""
  modelSpec:
    - name: qwen-qwen3-reranker-0-6-b
      repository: vllm/vllm-openai
      tag: v0.9.1
      modelURL: Qwen/Qwen3-Reranker-0.6B
      replicaCount: 1
      requestCPU: 8
      requestMemory: 16Gi
      requestGPU: 1
      vllmConfig:
        extraArgs:
          - >-
            --hf_overrides={"architectures":["Qwen3ForSequenceClassification"],"classifier_from_token":["no","yes"],"is_original_qwen3_reranker":true}
    - name: qwen-qwen3-reranker-4-b
      repository: vllm/vllm-openai
      tag: v0.9.1
      modelURL: Qwen/Qwen3-Reranker-4B
      replicaCount: 1
      requestCPU: 8
      requestMemory: 16Gi
      requestGPU: 1
      vllmConfig:
        extraArgs:
          - >-
            --hf_overrides={"architectures":["Qwen3ForSequenceClassification"],"classifier_from_token":["no","yes"],"is_original_qwen3_reranker":true}
    - name: qwen3-reranker-0-6-b-seq-cls
      repository: vllm/vllm-openai
      tag: v0.9.1
      modelURL: tomaarsen/Qwen3-Reranker-0.6B-seq-cls
      replicaCount: 1
      requestCPU: 8
      requestMemory: 16Gi
      requestGPU: 1

Per #19260, I used the --hf_overrides for the official Qwen 3 reranking models. The pods do start but the rerank endpoint fails with this response:

{
    "object": "error",
    "message": "The model does not support Rerank (Score) API",
    "type": "BadRequestError",
    "param": null,
    "code": 400
}

This happens for both Qwen/Qwen3-Reranker-0.6B and Qwen/Qwen3-Reranker-4B.

Additionally, the log for tomaarsen/Qwen3-Reranker-0.6B-seq-cls (pod fails to start with this error):

INFO 07-06 15:25:19 [__init__.py:244] Automatically detected platform cuda.
INFO 07-06 15:25:25 [api_server.py:1287] vLLM API server version 0.9.1
INFO 07-06 15:25:26 [cli_args.py:309] non-default args: {'host': '0.0.0.0', 'model': 'tomaarsen/Qwen3-Reranker-0.6B-seq-cls'}
INFO 07-06 15:25:33 [config.py:823] This model supports multiple tasks: {'generate', 'embed', 'classify', 'score', 'reward'}. Defaulting to 'generate'.
INFO 07-06 15:25:33 [config.py:3268] Downcasting torch.float32 to torch.bfloat16.
INFO 07-06 15:25:33 [config.py:2195] Chunked prefill is enabled with max_num_batched_tokens=2048.
WARNING 07-06 15:25:35 [env_override.py:17] NCCL_CUMEM_ENABLE is set to 0, skipping override. This may increase memory overhead with cudagraph+allreduce: https://github.com/NVIDIA/nccl/issues/1234
INFO 07-06 15:25:37 [__init__.py:244] Automatically detected platform cuda.
INFO 07-06 15:25:40 [core.py:455] Waiting for init message from front-end.
INFO 07-06 15:25:40 [core.py:70] Initializing a V1 LLM engine (v0.9.1) with config: model='tomaarsen/Qwen3-Reranker-0.6B-seq-cls', speculative_config=None, tokenizer='tomaarsen/Qwen3-Reranker-0.6B-seq-cls', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=40960, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=tomaarsen/Qwen3-Reranker-0.6B-seq-cls, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, pooler_config=None, compilation_config={"level":3,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":["none"],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output"],"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[512,504,496,488,480,472,464,456,448,440,432,424,416,408,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"max_capture_size":512,"local_cache_dir":null}
WARNING 07-06 15:25:40 [utils.py:2737] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x7f7e8e5739e0>
INFO 07-06 15:25:41 [parallel_state.py:1065] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
INFO 07-06 15:25:41 [topk_topp_sampler.py:49] Using FlashInfer for top-p & top-k sampling.
WARNING 07-06 15:25:41 [utils.py:211] Qwen3ForSequenceClassification has no vLLM implementation, falling back to Transformers implementation. Some features may not be supported and performance may not be optimal.
INFO 07-06 15:25:41 [gpu_model_runner.py:1595] Starting to load model tomaarsen/Qwen3-Reranker-0.6B-seq-cls...
INFO 07-06 15:25:41 [gpu_model_runner.py:1600] Loading model from scratch...
INFO 07-06 15:25:41 [transformers.py:146] Using Transformers backend.
INFO 07-06 15:25:42 [cuda.py:252] Using Flash Attention backend on V1 engine.
INFO 07-06 15:25:43 [weight_utils.py:292] Using model weights format ['*.safetensors']
INFO 07-06 15:25:48 [weight_utils.py:308] Time spent downloading weights for tomaarsen/Qwen3-Reranker-0.6B-seq-cls: 5.749590 seconds
INFO 07-06 15:25:48 [weight_utils.py:345] No model.safetensors.index.json found in remote.
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
ERROR 07-06 15:25:49 [core.py:515] EngineCore failed to start.
ERROR 07-06 15:25:49 [core.py:515] Traceback (most recent call last):
ERROR 07-06 15:25:49 [core.py:515]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 506, in run_engine_core
Process EngineCore_0:
ERROR 07-06 15:25:49 [core.py:515]     engine_core = EngineCoreProc(*args, **kwargs)
ERROR 07-06 15:25:49 [core.py:515]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-06 15:25:49 [core.py:515]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 390, in __init__
ERROR 07-06 15:25:49 [core.py:515]     super().__init__(vllm_config, executor_class, log_stats,
ERROR 07-06 15:25:49 [core.py:515]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 76, in __init__
ERROR 07-06 15:25:49 [core.py:515]     self.model_executor = executor_class(vllm_config)
ERROR 07-06 15:25:49 [core.py:515]                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-06 15:25:49 [core.py:515]   File "/usr/local/lib/python3.12/dist-packages/vllm/executor/executor_base.py", line 53, in __init__
ERROR 07-06 15:25:49 [core.py:515]     self._init_executor()
ERROR 07-06 15:25:49 [core.py:515]   File "/usr/local/lib/python3.12/dist-packages/vllm/executor/uniproc_executor.py", line 48, in _init_executor
ERROR 07-06 15:25:49 [core.py:515]     self.collective_rpc("load_model")
ERROR 07-06 15:25:49 [core.py:515]   File "/usr/local/lib/python3.12/dist-packages/vllm/executor/uniproc_executor.py", line 57, in collective_rpc
ERROR 07-06 15:25:49 [core.py:515]     answer = run_method(self.driver_worker, method, args, kwargs)
ERROR 07-06 15:25:49 [core.py:515]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-06 15:25:49 [core.py:515]   File "/usr/local/lib/python3.12/dist-packages/vllm/utils.py", line 2671, in run_method
ERROR 07-06 15:25:49 [core.py:515]     return func(*args, **kwargs)
ERROR 07-06 15:25:49 [core.py:515]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 07-06 15:25:49 [core.py:515]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 180, in load_model
ERROR 07-06 15:25:49 [core.py:515]     self.model_runner.load_model()
ERROR 07-06 15:25:49 [core.py:515]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 1601, in load_model
ERROR 07-06 15:25:49 [core.py:515]     self.model = model_loader.load_model(
ERROR 07-06 15:25:49 [core.py:515]                  ^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-06 15:25:49 [core.py:515]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/base_loader.py", line 41, in load_model
ERROR 07-06 15:25:49 [core.py:515]     self.load_weights(model, model_config)
ERROR 07-06 15:25:49 [core.py:515]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/default_loader.py", line 269, in load_weights
ERROR 07-06 15:25:49 [core.py:515]     loaded_weights = model.load_weights(
ERROR 07-06 15:25:49 [core.py:515]                      ^^^^^^^^^^^^^^^^^^^
ERROR 07-06 15:25:49 [core.py:515]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/transformers.py", line 508, in load_weights
ERROR 07-06 15:25:49 [core.py:515]     return loader.load_weights(weights, mapper=self.hf_to_vllm_mapper)
ERROR 07-06 15:25:49 [core.py:515]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-06 15:25:49 [core.py:515]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/utils.py", line 278, in load_weights
ERROR 07-06 15:25:49 [core.py:515]     autoloaded_weights = set(self._load_module("", self.module, weights))
ERROR 07-06 15:25:49 [core.py:515]                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-06 15:25:49 [core.py:515]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/utils.py", line 264, in _load_module
ERROR 07-06 15:25:49 [core.py:515]     raise ValueError(msg)
ERROR 07-06 15:25:49 [core.py:515] ValueError: There is no module or parameter named 'score' in TransformersForCausalLM
Traceback (most recent call last):
  File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 519, in run_engine_core
    raise e
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 506, in run_engine_core
    engine_core = EngineCoreProc(*args, **kwargs)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 390, in __init__
    super().__init__(vllm_config, executor_class, log_stats,
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 76, in __init__
    self.model_executor = executor_class(vllm_config)
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/executor/executor_base.py", line 53, in __init__
    self._init_executor()
  File "/usr/local/lib/python3.12/dist-packages/vllm/executor/uniproc_executor.py", line 48, in _init_executor
    self.collective_rpc("load_model")
  File "/usr/local/lib/python3.12/dist-packages/vllm/executor/uniproc_executor.py", line 57, in collective_rpc
    answer = run_method(self.driver_worker, method, args, kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/utils.py", line 2671, in run_method
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 180, in load_model
    self.model_runner.load_model()
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 1601, in load_model
    self.model = model_loader.load_model(
                 ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/base_loader.py", line 41, in load_model
    self.load_weights(model, model_config)
  File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/default_loader.py", line 269, in load_weights
    loaded_weights = model.load_weights(
                     ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/transformers.py", line 508, in load_weights
    return loader.load_weights(weights, mapper=self.hf_to_vllm_mapper)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/utils.py", line 278, in load_weights
    autoloaded_weights = set(self._load_module("", self.module, weights))
                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/utils.py", line 264, in _load_module
    raise ValueError(msg)
ValueError: There is no module or parameter named 'score' in TransformersForCausalLM
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]

[rank0]:[W706 15:25:50.244265807 ProcessGroupNCCL.cpp:1476] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
Traceback (most recent call last):
  File "/usr/local/bin/vllm", line 10, in <module>
    sys.exit(main())
             ^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/cli/main.py", line 59, in main
    args.dispatch_function(args)
  File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/cli/serve.py", line 58, in cmd
    uvloop.run(run_server(args))
  File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 109, in run
    return __asyncio.run(
           ^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/asyncio/runners.py", line 195, in run
    return runner.run(main)
           ^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run
    return self._loop.run_until_complete(task)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
  File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 61, in wrapper
    return await main
           ^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 1323, in run_server
    await run_server_worker(listen_address, sock, args, **uvicorn_kwargs)
  File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 1343, in run_server_worker
    async with build_async_engine_client(args, client_config) as engine_client:
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
    return await anext(self.gen)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 155, in build_async_engine_client
    async with build_async_engine_client_from_engine_args(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
    return await anext(self.gen)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 191, in build_async_engine_client_from_engine_args
    async_llm = AsyncLLM.from_vllm_config(
                ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 162, in from_vllm_config
    return cls(
           ^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 124, in __init__
    self.engine_core = EngineCoreClient.make_async_mp_client(
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 93, in make_async_mp_client
    return AsyncMPClient(vllm_config, executor_class, log_stats,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 716, in __init__
    super().__init__(
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 422, in __init__
    self._init_engines_direct(vllm_config, local_only,
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 491, in _init_engines_direct
    self._wait_for_engine_startup(handshake_socket, input_address,
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 511, in _wait_for_engine_startup
    wait_for_engine_startup(
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/utils.py", line 494, in wait_for_engine_startup
    raise RuntimeError("Engine core initialization failed. "
RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}

Just to add, we are able to run BAAI/bge-reranker-v2-m3 on our Helm setup without any issue but were facing this issue when trying to test with Qwen3 rerankers per suggestion in #20300.

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions