Skip to content

[Bug]: Qwen3-Reranker-vllm exhibits a large gap between offline and online inference. #20730

@whtDHU

Description

@whtDHU

Your current environment

Use the latest vllm version v0.9.2 and A40 GPU. model: Qwen3-reranker-0.6B and 4B.

🐛 Describe the bug

There is a significant gap between the online inference results and the offline inference results for Qwen3-reranker.

Online inference start command:

vllm serve Qwen3-Reranker-0.6B --host 0.0.0.0 --port 12501 --gpu_memory_utilization=0.5 --max-model-len 8192 --hf_overrides '{"architectures": ["Qwen3ForSequenceClassification"],"classifier_from_token": ["no", "yes"],"is_original_qwen3_reranker": true}'

curl:

curl -X POST "http://127.0.0.1:12501/rerank" \
  -H "accept: application/json" \
  -H "Content-Type: application/json" \
  -d '{

"query":"信息科电话是多少",
"documents":[
"科室名称:客服与投诉中心",
"科室代码:10086,科室名称:信息中心本级",
"科室代码:10086,科室名称:客服与投诉中心",
"科室:机房,电话号码:8888888811;短号:00909090",
"科室名称:信息中心本级",
"科室:信息中心值班,考勤单元:信息中心;电话号码:13131313131313;短号:090909090;上级科室:行政后勤"
]
}'

results:

{
    "results": [
        {
            "index": 3,
            "document": {
                "text": "科室:机房,电话号码:8888888811;短号:00909090"
            },
            "relevance_score": 0.7431679368019104
        },
        {
            "index": 2,
            "document": {
                "text": "科室代码:10086,科室名称:客服与投诉中心"
            },
            "relevance_score": 0.7371581792831421
        },
        {
            "index": 0,
            "document": {
                "text": "科室名称:客服与投诉中心"
            },
            "relevance_score": 0.671470582485199
        },
        {
            "index": 1,
            "document": {
                "text": "科室代码:10086,科室名称:信息中心本级"
            },
            "relevance_score": 0.6132366061210632
        },
        {
            "index": 4,
            "document": {
                "text": "科室名称:信息中心本级"
            },
            "relevance_score": 0.5636181831359863
        },
        {
            "index": 5,
            "document": {
                "text": "科室:信息中心值班,考勤单元:信息中心;电话号码:13131313131313;短号:090909090;上级科室:行政后勤"
            },
            "relevance_score": 0.18476751446723938
        }
    ]
}

Offline inference start command:
https://github.com/vllm-project/vllm/blob/main/examples/offline_inference/qwen3_reranker.py

data:

instruction = (
        "Given a web search query, retrieve relevant passages that answer the query"
    )

    queries = [
        "信息科电话是多少"
    ]

    documents = [
    "科室名称:客服与投诉中心",
    "科室代码:10086,科室名称:信息中心本级",
    "科室代码:10086,科室名称:客服与投诉中心",
    "科室:机房,电话号码:8888888811;短号:00909090",
    "科室名称:信息中心本级",
    "科室:信息中心值班,考勤单元:信息中心;电话号码:13131313131313;短号:090909090;上级科室:行政后勤"
]

results:

[0.004331501666456461, 0.2613309323787689, 0.14223189651966095, 0.974821150302887, 0.1127954050898552, 0.9966233968734741]

there is a significant difference between the online inference and offline inference results, and there is an issue with the online inference.
I tested both Qwen3-reranker-0.6 and 4B, and they both have this problem.

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions