[Bug]: 910B3 显卡 qwen2.5vl 72B 发送请求超时失败

### Your current environment

<details>
<summary>The output of `python collect_env.py`</summary>

```text
Your output of above commands here
```
==============================
        System Info
==============================
OS                           : Ubuntu 22.04.5 LTS (aarch64)
GCC version                  : (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version                : Could not collect
CMake version                : version 4.0.3
Libc version                 : glibc-2.35

==============================
       PyTorch Info
==============================
PyTorch version              : 2.7.1+cpu
Is debug build               : False
CUDA used to build PyTorch   : None
ROCM used to build PyTorch   : N/A

==============================
      Python Environment
==============================
Python version               : 3.11.13 (main, Jul 26 2025, 07:27:32) [GCC 11.4.0] (64-bit runtime)
Python platform              : Linux-5.10.0-136.50.0.129.r1.hp22.aarch64-aarch64-with-glibc2.35

==============================
       CUDA / GPU Info
==============================
Is CUDA available            : False
CUDA runtime version         : No CUDA
CUDA_MODULE_LOADING set to   : N/A
GPU models and configuration : No CUDA
Nvidia driver version        : No CUDA
cuDNN version                : No CUDA
HIP runtime version          : N/A
MIOpen runtime version       : N/A
Is XNNPACK available         : True

==============================
          CPU Info
==============================
Architecture:                       aarch64
CPU op-mode(s):                     64-bit
Byte Order:                         Little Endian
CPU(s):                             192
On-line CPU(s) list:                0-191
Vendor ID:                          HiSilicon
BIOS Vendor ID:                     HiSilicon
Model name:                         Kunpeng-920
BIOS Model name:                    HUAWEI Kunpeng 920 5250
Model:                              0
Thread(s) per core:                 1
Core(s) per socket:                 48
Socket(s):                          4
Stepping:                           0x1
BogoMIPS:                           200.00
Flags:                              fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma dcpop asimddp asimdfhm ssbs
L1d cache:                          12 MiB (192 instances)
L1i cache:                          12 MiB (192 instances)
L2 cache:                           96 MiB (192 instances)
L3 cache:                           192 MiB (8 instances)
NUMA node(s):                       8
NUMA node0 CPU(s):                  0-23
NUMA node1 CPU(s):                  24-47
NUMA node2 CPU(s):                  48-71
NUMA node3 CPU(s):                  72-95
NUMA node4 CPU(s):                  96-119
NUMA node5 CPU(s):                  120-143
NUMA node6 CPU(s):                  144-167
NUMA node7 CPU(s):                  168-191
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit:        Not affected
Vulnerability L1tf:                 Not affected
Vulnerability Mds:                  Not affected
Vulnerability Meltdown:             Not affected
Vulnerability Mmio stale data:      Not affected
Vulnerability Retbleed:             Not affected
Vulnerability Spec store bypass:    Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:           Mitigation; __user pointer sanitization
Vulnerability Spectre v2:           Not affected
Vulnerability Srbds:                Not affected
Vulnerability Tsx async abort:      Not affected

==============================
Versions of relevant libraries
==============================
[pip3] numpy==1.26.4
[pip3] pyzmq==27.0.1
[pip3] torch==2.7.1+cpu
[pip3] torch_npu==2.7.1.dev20250724
[pip3] torchvision==0.22.1
[pip3] transformers==4.53.3
[conda] Could not collect

==============================
         vLLM Info
==============================
ROCM Version                 : Could not collect
Neuron SDK Version           : N/A
vLLM Version                 : 0.10.0
vLLM Build Flags:
  CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
  Could not collect

==============================
     Environment Variables
==============================
LD_LIBRARY_PATH=/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_1/lib:/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_1/examples:/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_1/tests/atbopstest:/usr/local/Ascend/ascend-toolkit/latest/tools/aml/lib64:/usr/local/Ascend/ascend-toolkit/latest/tools/aml/lib64/plugin:/usr/local/Ascend/ascend-toolkit/latest/lib64:/usr/local/Ascend/ascend-toolkit/latest/lib64/plugin/opskernel:/usr/local/Ascend/ascend-toolkit/latest/lib64/plugin/nnengine:/usr/local/Ascend/ascend-toolkit/latest/opp/built-in/op_impl/ai_core/tbe/op_tiling/lib/linux/aarch64:/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_0/lib:/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_0/examples:/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_0/tests/atbopstest:/usr/local/Ascend/ascend-toolkit/latest/tools/aml/lib64:/usr/local/Ascend/ascend-toolkit/latest/tools/aml/lib64/plugin:/usr/local/Ascend/ascend-toolkit/latest/lib64:/usr/local/Ascend/ascend-toolkit/latest/lib64/plugin/opskernel:/usr/local/Ascend/ascend-toolkit/latest/lib64/plugin/nnengine:/usr/local/Ascend/ascend-toolkit/latest/opp/built-in/op_impl/ai_core/tbe/op_tiling:/usr/local/Ascend/driver/lib64/common/:/usr/local/Ascend/driver/lib64/driver/:
TORCH_DEVICE_BACKEND_AUTOLOAD=1
NCCL_CUMEM_ENABLE=0
PYTORCH_NVML_BASED_CUDA_CHECK=1
TORCHINDUCTOR_COMPILE_THREADS=1
TORCHINDUCTOR_CACHE_DIR=/tmp/torchinductor_root
VLLM_WORKER_MULTIPROC_METHOD=spawn

root@host105:/vllm-workspace/vllm#

</details>


### 🐛 Describe the bug

curl http://localhost:8000/v1/chat/completions      -H "Content-Type: application/json"     -d '{
        "model": "qwen2_5_vl",
        "messages": [{
                "role": "user",
                "content": [{
                                "type": "text",
                                "text": "这张图片展示了什么?"
                        },
                        {
                                "type": "image_url",
                                "image_url": {
                                        "url": "https://qcloud.dpfile.com/pc/BPM5Xb8gzl4PrqtrJ2Ir4-aX2bFrnd_H1pt8hm8R0YtQqusOJ_ESiPdf_7my1mHB.jpg"
                                }
                        }
                ]
        }]
}'

模型部署：vllm serve /home/llm_model/Qwen/Qwen2.5-VL-72B-Instruct/ --load-format runai_streamer --tensor-parallel-size 4 --max-model-len 20000 --max-num-seqs 2048 --kv-cache auto --gpu-memory-utilization 1.0 --disable-custom-all-reduce --served-model-name qwen2_5_vl --disable-log-requests

镜像文件：quay.io/ascend/vllm-ascend:v0.10.0rc1

**Qwen2.5-VL-7B-Instruct Qwen2.5-VL-32B-Instruct  请求成功**

**Qwen2.5-VL-72B-Instruct  请求失败**

ERROR 09-12 09:33:06 [dump_input.py:69] Dumping input data for V1 LLM engine (v0.10.0) with config: model='/home/llm_model/Qwen/Qwen2.5-VL-72B-Instruct/', speculative_config=None, tokenizer='/home/llm_model/Qwen/Qwen2.5-VL-72B-Instruct/', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=20000, download_dir=None, load_format=LoadFormat.RUNAI_STREAMER, tensor_parallel_size=4, pipeline_parallel_size=1, disable_custom_all_reduce=True, quantization=None, enforce_eager=False, kv_cache_dtype=auto,  device_config=npu, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=qwen2_5_vl, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, pooler_config=None, compilation_config={"level":3,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":["all"],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output","vllm.mamba_mixer2","vllm.unified_ascend_attention_with_output","vllm.unified_ascend_attention_with_output"],"use_inductor":false,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[512,456,408,352,304,248,192,144,88,40,1],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"max_capture_size":512,"local_cache_dir":null},
ERROR 09-12 09:33:06 [dump_input.py:76] Dumping scheduler output for model execution: SchedulerOutput(scheduled_new_reqs=[NewRequestData(req_id=chatcmpl-3c4332c83e6c46b484470469d6b4025a,prompt_token_ids_len=1395,mm_inputs=[{'image_grid_thw': tensor([[ 1, 74, 74]]), 'pixel_values': tensor([[0.5430, 0.5430, 0.5430,  ..., 1.0547, 1.0547, 1.0547],
ERROR 09-12 09:33:06 [dump_input.py:76]         [0.5430, 0.5430, 0.5273,  ..., 1.0547, 1.0547, 1.0547],
ERROR 09-12 09:33:06 [dump_input.py:76]         [0.5742, 0.5742, 0.5742,  ..., 1.0547, 1.0547, 1.0547],
ERROR 09-12 09:33:06 [dump_input.py:76]         ...,
ERROR 09-12 09:33:06 [dump_input.py:76]         [1.7969, 1.7969, 1.7969,  ..., 1.8594, 1.8594, 1.8594],
ERROR 09-12 09:33:06 [dump_input.py:76]         [1.8125, 1.8125, 1.8125,  ..., 1.8594, 1.8594, 1.8594],
ERROR 09-12 09:33:06 [dump_input.py:76]         [1.8125, 1.8125, 1.8125,  ..., 1.8594, 1.8594, 1.8594]],
ERROR 09-12 09:33:06 [dump_input.py:76]        dtype=torch.bfloat16)}],mm_hashes=['3bd83757ab0236d2fd03ecfc1e9f5642c7df57369b192c409456307d73d4edf1'],mm_positions=[PlaceholderRange(offset=20, length=1369, is_embed=None)],sampling_params=SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.05, temperature=0.01, top_p=1.0, top_k=0, min_p=0.0, seed=None, stop=[], stop_token_ids=[151643], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=19973, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None, extra_args=None),block_ids=([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11],),num_computed_tokens=0,lora_request=None)], scheduled_cached_reqs=CachedRequestData(req_ids=[], resumed_from_preemption=[], new_token_ids=[], new_block_ids=[], num_computed_tokens=[]), num_scheduled_tokens={chatcmpl-3c4332c83e6c46b484470469d6b4025a: 1395}, total_num_scheduled_tokens=1395, scheduled_spec_decode_tokens={}, scheduled_encoder_inputs={chatcmpl-3c4332c83e6c46b484470469d6b4025a: [0]}, num_common_prefix_blocks=[11], finished_req_ids=[], free_encoder_input_ids=[], structured_output_request_ids={}, grammar_bitmask=null, kv_connector_metadata=null)
ERROR 09-12 09:33:06 [dump_input.py:79] Dumping scheduler stats: SchedulerStats(num_running_reqs=1, num_waiting_reqs=0, kv_cache_usage=0.005095541401273884, prefix_cache_stats=PrefixCacheStats(reset=False, requests=1, queries=1395, hits=0), spec_decoding_stats=None, num_corrupted_reqs=0)
ERROR 09-12 09:33:06 [core.py:634] EngineCore encountered a fatal error.
ERROR 09-12 09:33:06 [core.py:634] Traceback (most recent call last):
ERROR 09-12 09:33:06 [core.py:634]   File "/vllm-workspace/vllm/vllm/v1/executor/multiproc_executor.py", line 237, in collective_rpc
ERROR 09-12 09:33:06 [core.py:634]     result = get_response(w, dequeue_timeout)
ERROR 09-12 09:33:06 [core.py:634]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 09-12 09:33:06 [core.py:634]   File "/vllm-workspace/vllm/vllm/v1/executor/multiproc_executor.py", line 220, in get_response
ERROR 09-12 09:33:06 [core.py:634]     status, result = w.worker_response_mq.dequeue(
ERROR 09-12 09:33:06 [core.py:634]                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 09-12 09:33:06 [core.py:634]   File "/vllm-workspace/vllm/vllm/distributed/device_communicators/shm_broadcast.py", line 507, in dequeue
ERROR 09-12 09:33:06 [core.py:634]     with self.acquire_read(timeout, cancel) as buf:
ERROR 09-12 09:33:06 [core.py:634]   File "/usr/local/python3.11.13/lib/python3.11/contextlib.py", line 137, in __enter__
ERROR 09-12 09:33:06 [core.py:634]     return next(self.gen)
ERROR 09-12 09:33:06 [core.py:634]            ^^^^^^^^^^^^^^
ERROR 09-12 09:33:06 [core.py:634]   File "/vllm-workspace/vllm/vllm/distributed/device_communicators/shm_broadcast.py", line 469, in acquire_read
ERROR 09-12 09:33:06 [core.py:634]     raise TimeoutError
ERROR 09-12 09:33:06 [core.py:634] TimeoutError
ERROR 09-12 09:33:06 [core.py:634]
ERROR 09-12 09:33:06 [core.py:634] The above exception was the direct cause of the following exception:
ERROR 09-12 09:33:06 [core.py:634]
ERROR 09-12 09:33:06 [core.py:634] Traceback (most recent call last):
ERROR 09-12 09:33:06 [core.py:634]   File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 625, in run_engine_core
ERROR 09-12 09:33:06 [core.py:634]     engine_core.run_busy_loop()
ERROR 09-12 09:33:06 [core.py:634]   File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 652, in run_busy_loop
ERROR 09-12 09:33:06 [core.py:634]     self._process_engine_step()
ERROR 09-12 09:33:06 [core.py:634]   File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 677, in _process_engine_step
ERROR 09-12 09:33:06 [core.py:634]     outputs, model_executed = self.step_fn()
ERROR 09-12 09:33:06 [core.py:634]                               ^^^^^^^^^^^^^^
ERROR 09-12 09:33:06 [core.py:634]   File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 267, in step
ERROR 09-12 09:33:06 [core.py:634]     model_output = self.execute_model_with_error_logging(
ERROR 09-12 09:33:06 [core.py:634]                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 09-12 09:33:06 [core.py:634]   File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 253, in execute_model_with_error_logging
ERROR 09-12 09:33:06 [core.py:634]     raise err
ERROR 09-12 09:33:06 [core.py:634]   File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 244, in execute_model_with_error_logging
ERROR 09-12 09:33:06 [core.py:634]     return model_fn(scheduler_output)
ERROR 09-12 09:33:06 [core.py:634]            ^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 09-12 09:33:06 [core.py:634]   File "/vllm-workspace/vllm/vllm/v1/executor/multiproc_executor.py", line 167, in execute_model
ERROR 09-12 09:33:06 [core.py:634]     (output, ) = self.collective_rpc(
ERROR 09-12 09:33:06 [core.py:634]                  ^^^^^^^^^^^^^^^^^^^^
ERROR 09-12 09:33:06 [core.py:634]   File "/vllm-workspace/vllm/vllm/v1/executor/multiproc_executor.py", line 243, in collective_rpc
ERROR 09-12 09:33:06 [core.py:634]     raise TimeoutError(f"RPC call to {method} timed out.") from e
ERROR 09-12 09:33:06 [core.py:634] TimeoutError: RPC call to execute_model timed out.
ERROR 09-12 09:33:06 [async_llm.py:416] AsyncLLM output_handler failed.
ERROR 09-12 09:33:06 [async_llm.py:416] Traceback (most recent call last):
ERROR 09-12 09:33:06 [async_llm.py:416]   File "/vllm-workspace/vllm/vllm/v1/engine/async_llm.py", line 375, in output_handler
ERROR 09-12 09:33:06 [async_llm.py:416]     outputs = await engine_core.get_output_async()
ERROR 09-12 09:33:06 [async_llm.py:416]               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 09-12 09:33:06 [async_llm.py:416]   File "/vllm-workspace/vllm/vllm/v1/engine/core_client.py", line 751, in get_output_async
ERROR 09-12 09:33:06 [async_llm.py:416]     raise self._format_exception(outputs) from None
ERROR 09-12 09:33:06 [async_llm.py:416] vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause.
INFO:     127.0.0.1:48652 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
INFO:     Shutting down
INFO:     Waiting for application shutdown.
INFO:     Application shutdown complete.
INFO:     Finished server process [30398]



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bug]: 910B3 显卡 qwen2.5vl 72B 发送请求超时失败 #2891

Your current environment

==============================
System Info

==============================
PyTorch Info

==============================
Python Environment

==============================
CUDA / GPU Info

==============================
CPU Info

==============================
Versions of relevant libraries

==============================
vLLM Info

==============================
Environment Variables

🐛 Describe the bug

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug]: 910B3 显卡 qwen2.5vl 72B 发送请求超时失败 #2891

Description

Your current environment

============================== System Info

============================== PyTorch Info

============================== Python Environment

============================== CUDA / GPU Info

============================== CPU Info

============================== Versions of relevant libraries

============================== vLLM Info

============================== Environment Variables

🐛 Describe the bug

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

==============================
System Info

==============================
PyTorch Info

==============================
Python Environment

==============================
CUDA / GPU Info

==============================
CPU Info

==============================
Versions of relevant libraries

==============================
vLLM Info

==============================
Environment Variables