-
Notifications
You must be signed in to change notification settings - Fork 435
Description
Your current environment
The output of `python collect_env.py`
Your output of above commands here
==============================
System Info
OS : Ubuntu 22.04.5 LTS (aarch64)
GCC version : (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version : Could not collect
CMake version : version 4.0.3
Libc version : glibc-2.35
==============================
PyTorch Info
PyTorch version : 2.7.1+cpu
Is debug build : False
CUDA used to build PyTorch : None
ROCM used to build PyTorch : N/A
==============================
Python Environment
Python version : 3.11.13 (main, Jul 26 2025, 07:27:32) [GCC 11.4.0] (64-bit runtime)
Python platform : Linux-5.10.0-136.50.0.129.r1.hp22.aarch64-aarch64-with-glibc2.35
==============================
CUDA / GPU Info
Is CUDA available : False
CUDA runtime version : No CUDA
CUDA_MODULE_LOADING set to : N/A
GPU models and configuration : No CUDA
Nvidia driver version : No CUDA
cuDNN version : No CUDA
HIP runtime version : N/A
MIOpen runtime version : N/A
Is XNNPACK available : True
==============================
CPU Info
Architecture: aarch64
CPU op-mode(s): 64-bit
Byte Order: Little Endian
CPU(s): 192
On-line CPU(s) list: 0-191
Vendor ID: HiSilicon
BIOS Vendor ID: HiSilicon
Model name: Kunpeng-920
BIOS Model name: HUAWEI Kunpeng 920 5250
Model: 0
Thread(s) per core: 1
Core(s) per socket: 48
Socket(s): 4
Stepping: 0x1
BogoMIPS: 200.00
Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma dcpop asimddp asimdfhm ssbs
L1d cache: 12 MiB (192 instances)
L1i cache: 12 MiB (192 instances)
L2 cache: 96 MiB (192 instances)
L3 cache: 192 MiB (8 instances)
NUMA node(s): 8
NUMA node0 CPU(s): 0-23
NUMA node1 CPU(s): 24-47
NUMA node2 CPU(s): 48-71
NUMA node3 CPU(s): 72-95
NUMA node4 CPU(s): 96-119
NUMA node5 CPU(s): 120-143
NUMA node6 CPU(s): 144-167
NUMA node7 CPU(s): 168-191
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit: Not affected
Vulnerability L1tf: Not affected
Vulnerability Mds: Not affected
Vulnerability Meltdown: Not affected
Vulnerability Mmio stale data: Not affected
Vulnerability Retbleed: Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1: Mitigation; __user pointer sanitization
Vulnerability Spectre v2: Not affected
Vulnerability Srbds: Not affected
Vulnerability Tsx async abort: Not affected
==============================
Versions of relevant libraries
[pip3] numpy==1.26.4
[pip3] pyzmq==27.0.1
[pip3] torch==2.7.1+cpu
[pip3] torch_npu==2.7.1.dev20250724
[pip3] torchvision==0.22.1
[pip3] transformers==4.53.3
[conda] Could not collect
==============================
vLLM Info
ROCM Version : Could not collect
Neuron SDK Version : N/A
vLLM Version : 0.10.0
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
Could not collect
==============================
Environment Variables
LD_LIBRARY_PATH=/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_1/lib:/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_1/examples:/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_1/tests/atbopstest:/usr/local/Ascend/ascend-toolkit/latest/tools/aml/lib64:/usr/local/Ascend/ascend-toolkit/latest/tools/aml/lib64/plugin:/usr/local/Ascend/ascend-toolkit/latest/lib64:/usr/local/Ascend/ascend-toolkit/latest/lib64/plugin/opskernel:/usr/local/Ascend/ascend-toolkit/latest/lib64/plugin/nnengine:/usr/local/Ascend/ascend-toolkit/latest/opp/built-in/op_impl/ai_core/tbe/op_tiling/lib/linux/aarch64:/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_0/lib:/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_0/examples:/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_0/tests/atbopstest:/usr/local/Ascend/ascend-toolkit/latest/tools/aml/lib64:/usr/local/Ascend/ascend-toolkit/latest/tools/aml/lib64/plugin:/usr/local/Ascend/ascend-toolkit/latest/lib64:/usr/local/Ascend/ascend-toolkit/latest/lib64/plugin/opskernel:/usr/local/Ascend/ascend-toolkit/latest/lib64/plugin/nnengine:/usr/local/Ascend/ascend-toolkit/latest/opp/built-in/op_impl/ai_core/tbe/op_tiling:/usr/local/Ascend/driver/lib64/common/:/usr/local/Ascend/driver/lib64/driver/:
TORCH_DEVICE_BACKEND_AUTOLOAD=1
NCCL_CUMEM_ENABLE=0
PYTORCH_NVML_BASED_CUDA_CHECK=1
TORCHINDUCTOR_COMPILE_THREADS=1
TORCHINDUCTOR_CACHE_DIR=/tmp/torchinductor_root
VLLM_WORKER_MULTIPROC_METHOD=spawn
root@host105:/vllm-workspace/vllm#
🐛 Describe the bug
curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "qwen2_5_vl",
"messages": [{
"role": "user",
"content": [{
"type": "text",
"text": "这张图片展示了什么?"
},
{
"type": "image_url",
"image_url": {
"url": "https://qcloud.dpfile.com/pc/BPM5Xb8gzl4PrqtrJ2Ir4-aX2bFrnd_H1pt8hm8R0YtQqusOJ_ESiPdf_7my1mHB.jpg"
}
}
]
}]
}'
模型部署:vllm serve /home/llm_model/Qwen/Qwen2.5-VL-72B-Instruct/ --load-format runai_streamer --tensor-parallel-size 4 --max-model-len 20000 --max-num-seqs 2048 --kv-cache auto --gpu-memory-utilization 1.0 --disable-custom-all-reduce --served-model-name qwen2_5_vl --disable-log-requests
镜像文件:quay.io/ascend/vllm-ascend:v0.10.0rc1
Qwen2.5-VL-7B-Instruct Qwen2.5-VL-32B-Instruct 请求成功
Qwen2.5-VL-72B-Instruct 请求失败
ERROR 09-12 09:33:06 [dump_input.py:69] Dumping input data for V1 LLM engine (v0.10.0) with config: model='/home/llm_model/Qwen/Qwen2.5-VL-72B-Instruct/', speculative_config=None, tokenizer='/home/llm_model/Qwen/Qwen2.5-VL-72B-Instruct/', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=20000, download_dir=None, load_format=LoadFormat.RUNAI_STREAMER, tensor_parallel_size=4, pipeline_parallel_size=1, disable_custom_all_reduce=True, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=npu, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=qwen2_5_vl, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, pooler_config=None, compilation_config={"level":3,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":["all"],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output","vllm.mamba_mixer2","vllm.unified_ascend_attention_with_output","vllm.unified_ascend_attention_with_output"],"use_inductor":false,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[512,456,408,352,304,248,192,144,88,40,1],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"max_capture_size":512,"local_cache_dir":null},
ERROR 09-12 09:33:06 [dump_input.py:76] Dumping scheduler output for model execution: SchedulerOutput(scheduled_new_reqs=[NewRequestData(req_id=chatcmpl-3c4332c83e6c46b484470469d6b4025a,prompt_token_ids_len=1395,mm_inputs=[{'image_grid_thw': tensor([[ 1, 74, 74]]), 'pixel_values': tensor([[0.5430, 0.5430, 0.5430, ..., 1.0547, 1.0547, 1.0547],
ERROR 09-12 09:33:06 [dump_input.py:76] [0.5430, 0.5430, 0.5273, ..., 1.0547, 1.0547, 1.0547],
ERROR 09-12 09:33:06 [dump_input.py:76] [0.5742, 0.5742, 0.5742, ..., 1.0547, 1.0547, 1.0547],
ERROR 09-12 09:33:06 [dump_input.py:76] ...,
ERROR 09-12 09:33:06 [dump_input.py:76] [1.7969, 1.7969, 1.7969, ..., 1.8594, 1.8594, 1.8594],
ERROR 09-12 09:33:06 [dump_input.py:76] [1.8125, 1.8125, 1.8125, ..., 1.8594, 1.8594, 1.8594],
ERROR 09-12 09:33:06 [dump_input.py:76] [1.8125, 1.8125, 1.8125, ..., 1.8594, 1.8594, 1.8594]],
ERROR 09-12 09:33:06 [dump_input.py:76] dtype=torch.bfloat16)}],mm_hashes=['3bd83757ab0236d2fd03ecfc1e9f5642c7df57369b192c409456307d73d4edf1'],mm_positions=[PlaceholderRange(offset=20, length=1369, is_embed=None)],sampling_params=SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.05, temperature=0.01, top_p=1.0, top_k=0, min_p=0.0, seed=None, stop=[], stop_token_ids=[151643], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=19973, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None, extra_args=None),block_ids=([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11],),num_computed_tokens=0,lora_request=None)], scheduled_cached_reqs=CachedRequestData(req_ids=[], resumed_from_preemption=[], new_token_ids=[], new_block_ids=[], num_computed_tokens=[]), num_scheduled_tokens={chatcmpl-3c4332c83e6c46b484470469d6b4025a: 1395}, total_num_scheduled_tokens=1395, scheduled_spec_decode_tokens={}, scheduled_encoder_inputs={chatcmpl-3c4332c83e6c46b484470469d6b4025a: [0]}, num_common_prefix_blocks=[11], finished_req_ids=[], free_encoder_input_ids=[], structured_output_request_ids={}, grammar_bitmask=null, kv_connector_metadata=null)
ERROR 09-12 09:33:06 [dump_input.py:79] Dumping scheduler stats: SchedulerStats(num_running_reqs=1, num_waiting_reqs=0, kv_cache_usage=0.005095541401273884, prefix_cache_stats=PrefixCacheStats(reset=False, requests=1, queries=1395, hits=0), spec_decoding_stats=None, num_corrupted_reqs=0)
ERROR 09-12 09:33:06 [core.py:634] EngineCore encountered a fatal error.
ERROR 09-12 09:33:06 [core.py:634] Traceback (most recent call last):
ERROR 09-12 09:33:06 [core.py:634] File "/vllm-workspace/vllm/vllm/v1/executor/multiproc_executor.py", line 237, in collective_rpc
ERROR 09-12 09:33:06 [core.py:634] result = get_response(w, dequeue_timeout)
ERROR 09-12 09:33:06 [core.py:634] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 09-12 09:33:06 [core.py:634] File "/vllm-workspace/vllm/vllm/v1/executor/multiproc_executor.py", line 220, in get_response
ERROR 09-12 09:33:06 [core.py:634] status, result = w.worker_response_mq.dequeue(
ERROR 09-12 09:33:06 [core.py:634] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 09-12 09:33:06 [core.py:634] File "/vllm-workspace/vllm/vllm/distributed/device_communicators/shm_broadcast.py", line 507, in dequeue
ERROR 09-12 09:33:06 [core.py:634] with self.acquire_read(timeout, cancel) as buf:
ERROR 09-12 09:33:06 [core.py:634] File "/usr/local/python3.11.13/lib/python3.11/contextlib.py", line 137, in enter
ERROR 09-12 09:33:06 [core.py:634] return next(self.gen)
ERROR 09-12 09:33:06 [core.py:634] ^^^^^^^^^^^^^^
ERROR 09-12 09:33:06 [core.py:634] File "/vllm-workspace/vllm/vllm/distributed/device_communicators/shm_broadcast.py", line 469, in acquire_read
ERROR 09-12 09:33:06 [core.py:634] raise TimeoutError
ERROR 09-12 09:33:06 [core.py:634] TimeoutError
ERROR 09-12 09:33:06 [core.py:634]
ERROR 09-12 09:33:06 [core.py:634] The above exception was the direct cause of the following exception:
ERROR 09-12 09:33:06 [core.py:634]
ERROR 09-12 09:33:06 [core.py:634] Traceback (most recent call last):
ERROR 09-12 09:33:06 [core.py:634] File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 625, in run_engine_core
ERROR 09-12 09:33:06 [core.py:634] engine_core.run_busy_loop()
ERROR 09-12 09:33:06 [core.py:634] File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 652, in run_busy_loop
ERROR 09-12 09:33:06 [core.py:634] self._process_engine_step()
ERROR 09-12 09:33:06 [core.py:634] File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 677, in _process_engine_step
ERROR 09-12 09:33:06 [core.py:634] outputs, model_executed = self.step_fn()
ERROR 09-12 09:33:06 [core.py:634] ^^^^^^^^^^^^^^
ERROR 09-12 09:33:06 [core.py:634] File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 267, in step
ERROR 09-12 09:33:06 [core.py:634] model_output = self.execute_model_with_error_logging(
ERROR 09-12 09:33:06 [core.py:634] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 09-12 09:33:06 [core.py:634] File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 253, in execute_model_with_error_logging
ERROR 09-12 09:33:06 [core.py:634] raise err
ERROR 09-12 09:33:06 [core.py:634] File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 244, in execute_model_with_error_logging
ERROR 09-12 09:33:06 [core.py:634] return model_fn(scheduler_output)
ERROR 09-12 09:33:06 [core.py:634] ^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 09-12 09:33:06 [core.py:634] File "/vllm-workspace/vllm/vllm/v1/executor/multiproc_executor.py", line 167, in execute_model
ERROR 09-12 09:33:06 [core.py:634] (output, ) = self.collective_rpc(
ERROR 09-12 09:33:06 [core.py:634] ^^^^^^^^^^^^^^^^^^^^
ERROR 09-12 09:33:06 [core.py:634] File "/vllm-workspace/vllm/vllm/v1/executor/multiproc_executor.py", line 243, in collective_rpc
ERROR 09-12 09:33:06 [core.py:634] raise TimeoutError(f"RPC call to {method} timed out.") from e
ERROR 09-12 09:33:06 [core.py:634] TimeoutError: RPC call to execute_model timed out.
ERROR 09-12 09:33:06 [async_llm.py:416] AsyncLLM output_handler failed.
ERROR 09-12 09:33:06 [async_llm.py:416] Traceback (most recent call last):
ERROR 09-12 09:33:06 [async_llm.py:416] File "/vllm-workspace/vllm/vllm/v1/engine/async_llm.py", line 375, in output_handler
ERROR 09-12 09:33:06 [async_llm.py:416] outputs = await engine_core.get_output_async()
ERROR 09-12 09:33:06 [async_llm.py:416] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 09-12 09:33:06 [async_llm.py:416] File "/vllm-workspace/vllm/vllm/v1/engine/core_client.py", line 751, in get_output_async
ERROR 09-12 09:33:06 [async_llm.py:416] raise self._format_exception(outputs) from None
ERROR 09-12 09:33:06 [async_llm.py:416] vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause.
INFO: 127.0.0.1:48652 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
INFO: Shutting down
INFO: Waiting for application shutdown.
INFO: Application shutdown complete.
INFO: Finished server process [30398]