-
Notifications
You must be signed in to change notification settings - Fork 435
Description
Your current environment
910B3
package_name=Ascend-cann-toolkit
version=8.2.RC1
vLLM Version: 0.10.1.1
vLLM Ascend Version: 0.10.1rc2.dev31+g5bcb4c152.d20250909 (git sha: 5bcb4c1, date: 20250909)
🐛 Describe the bug
Model: GLM-4.5V-W8A8
视觉层 和语言层全部都量化,在模型推理时候,会爆一下错误。
(APIServer pid=246) INFO: Started server process [246]
(APIServer pid=246) INFO: Waiting for application startup.
(APIServer pid=246) INFO: Application startup complete.
(APIServer pid=246) WARNING 09-15 08:09:29 [protocol.py:81] The following fields were present in the request but ignored: {'do_sample'}
(APIServer pid=246) Using a slow image processor as use_fast
is unset and a slow processor was saved with this model. use_fast=True
will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with use_fast=False
.
(APIServer pid=246) INFO 09-15 08:09:31 [chat_utils.py:470] Detected the chat template content format to be 'openai'. You can set --chat-template-content-format
to override this.
(APIServer pid=246) WARNING 09-15 08:09:31 [protocol.py:81] The following fields were present in the request but ignored: {'do_sample'}
(APIServer pid=246) WARNING 09-15 08:09:31 [protocol.py:81] The following fields were present in the request but ignored: {'do_sample'}
(APIServer pid=246) WARNING 09-15 08:09:31 [protocol.py:81] The following fields were present in the request but ignored: {'do_sample'}
(APIServer pid=246) WARNING 09-15 08:09:31 [protocol.py:81] The following fields were present in the request but ignored: {'do_sample'}
(APIServer pid=246) WARNING 09-15 08:09:31 [protocol.py:81] The following fields were present in the request but ignored: {'do_sample'}
(APIServer pid=246) WARNING 09-15 08:09:31 [protocol.py:81] The following fields were present in the request but ignored: {'do_sample'}
(APIServer pid=246) WARNING 09-15 08:09:31 [protocol.py:81] The following fields were present in the request but ignored: {'do_sample'}
(APIServer pid=246) WARNING 09-15 08:09:31 [protocol.py:81] The following fields were present in the request but ignored: {'do_sample'}
(APIServer pid=246) WARNING 09-15 08:09:32 [protocol.py:81] The following fields were present in the request but ignored: {'do_sample'}
(APIServer pid=246) WARNING 09-15 08:09:32 [protocol.py:81] The following fields were present in the request but ignored: {'do_sample'}
(APIServer pid=246) WARNING 09-15 08:09:32 [protocol.py:81] The following fields were present in the request but ignored: {'do_sample'}
(APIServer pid=246) INFO: 127.0.0.1:38954 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=246) INFO: 127.0.0.1:38956 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(EngineCore_0 pid=515) ('Warning: torch.save with "_use_new_zipfile_serialization = False" is not recommended for npu tensor, which may bring unexpected errors and hopefully set "_use_new_zipfile_serialization = True"', 'if it is necessary to use this, please convert the npu tensor to cpu tensor for saving')
(VllmWorker TP0 pid=651) ERROR 09-15 08:09:40 [multiproc_executor.py:596] WorkerProc hit an exception.
(VllmWorker TP0 pid=651) ERROR 09-15 08:09:40 [multiproc_executor.py:596] Traceback (most recent call last):
(VllmWorker TP0 pid=651) ERROR 09-15 08:09:40 [multiproc_executor.py:596] File "/vllm-workspace/vllm/vllm/v1/executor/multiproc_executor.py", line 591, in worker_busy_loop
(VllmWorker TP0 pid=651) ERROR 09-15 08:09:40 [multiproc_executor.py:596] output = func(*args, **kwargs)
(VllmWorker TP0 pid=651) ERROR 09-15 08:09:40 [multiproc_executor.py:596] ^^^^^^^^^^^^^^^^^^^^^
(VllmWorker TP0 pid=651) ERROR 09-15 08:09:40 [multiproc_executor.py:596] File "/vllm-workspace/vllm-ascend/vllm_ascend/worker/worker_v1.py", line 205, in execute_model
(VllmWorker TP0 pid=651) ERROR 09-15 08:09:40 [multiproc_executor.py:596] output = self.model_runner.execute_model(scheduler_output,
(VllmWorker TP0 pid=651) ERROR 09-15 08:09:40 [multiproc_executor.py:596] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker TP0 pid=651) ERROR 09-15 08:09:40 [multiproc_executor.py:596] File "/usr/local/python3.11.13/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorker TP0 pid=651) ERROR 09-15 08:09:40 [multiproc_executor.py:596] return func(*args, **kwargs)
(VllmWorker TP0 pid=651) ERROR 09-15 08:09:40 [multiproc_executor.py:596] ^^^^^^^^^^^^^^^^^^^^^
(VllmWorker TP0 pid=651) ERROR 09-15 08:09:40 [multiproc_executor.py:596] File "/vllm-workspace/vllm-ascend/vllm_ascend/worker/model_runner_v1.py", line 1543, in execute_model
(VllmWorker TP0 pid=651) ERROR 09-15 08:09:40 [multiproc_executor.py:596] intermediate_tensors) = (self._prepare_inputs(
(VllmWorker TP0 pid=651) ERROR 09-15 08:09:40 [multiproc_executor.py:596] ^^^^^^^^^^^^^^^^^^^^^
(VllmWorker TP0 pid=651) ERROR 09-15 08:09:40 [multiproc_executor.py:596] File "/vllm-workspace/vllm-ascend/vllm_ascend/worker/model_runner_v1.py", line 1096, in _prepare_inputs
(VllmWorker TP0 pid=651) ERROR 09-15 08:09:40 [multiproc_executor.py:596] self._execute_mm_encoder(scheduler_output)
(VllmWorker TP0 pid=651) ERROR 09-15 08:09:40 [multiproc_executor.py:596] File "/vllm-workspace/vllm-ascend/vllm_ascend/worker/model_runner_v1.py", line 792, in _execute_mm_encoder
(VllmWorker TP0 pid=651) ERROR 09-15 08:09:40 [multiproc_executor.py:596] curr_group_outputs = self.model.get_multimodal_embeddings(
(VllmWorker TP0 pid=651) ERROR 09-15 08:09:40 [multiproc_executor.py:596] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker TP0 pid=651) ERROR 09-15 08:09:40 [multiproc_executor.py:596] File "/vllm-workspace/vllm/vllm/model_executor/models/glm4_1v.py", line 1468, in get_multimodal_embeddings
(VllmWorker TP0 pid=651) ERROR 09-15 08:09:40 [multiproc_executor.py:596] vision_embeddings = self._process_image_input(multimodal_input)
(VllmWorker TP0 pid=651) ERROR 09-15 08:09:40 [multiproc_executor.py:596] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker TP0 pid=651) ERROR 09-15 08:09:40 [multiproc_executor.py:596] File "/vllm-workspace/vllm/vllm/model_executor/models/glm4_1v.py", line 1403, in _process_image_input
(VllmWorker TP0 pid=651) ERROR 09-15 08:09:40 [multiproc_executor.py:596] image_embeds = self.visual(pixel_values, grid_thw=grid_thw)
(VllmWorker TP0 pid=651) ERROR 09-15 08:09:40 [multiproc_executor.py:596] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker TP0 pid=651) ERROR 09-15 08:09:40 [multiproc_executor.py:596] File "/usr/local/python3.11.13/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
(VllmWorker TP0 pid=651) ERROR 09-15 08:09:40 [multiproc_executor.py:596] return self._call_impl(*args, **kwargs)
(VllmWorker TP0 pid=651) ERROR 09-15 08:09:40 [multiproc_executor.py:596] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker TP0 pid=651) ERROR 09-15 08:09:40 [multiproc_executor.py:596] File "/usr/local/python3.11.13/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
(VllmWorker TP0 pid=651) ERROR 09-15 08:09:40 [multiproc_executor.py:596] return forward_call(*args, **kwargs)
(VllmWorker TP0 pid=651) ERROR 09-15 08:09:40 [multiproc_executor.py:596] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker TP0 pid=651) ERROR 09-15 08:09:40 [multiproc_executor.py:596] File "/vllm-workspace/vllm/vllm/model_executor/models/glm4_1v.py", line 771, in forward
(VllmWorker TP0 pid=651) ERROR 09-15 08:09:40 [multiproc_executor.py:596] x = blk(
(VllmWorker TP0 pid=651) ERROR 09-15 08:09:40 [multiproc_executor.py:596] ^^^^
(VllmWorker TP0 pid=651) ERROR 09-15 08:09:40 [multiproc_executor.py:596] File "/usr/local/python3.11.13/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
(VllmWorker TP0 pid=651) ERROR 09-15 08:09:40 [multiproc_executor.py:596] return self._call_impl(*args, **kwargs)
(VllmWorker TP0 pid=651) ERROR 09-15 08:09:40 [multiproc_executor.py:596] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker TP0 pid=651) ERROR 09-15 08:09:40 [multiproc_executor.py:596] File "/usr/local/python3.11.13/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
(VllmWorker TP0 pid=651) ERROR 09-15 08:09:40 [multiproc_executor.py:596] return forward_call(*args, **kwargs)
(VllmWorker TP0 pid=651) ERROR 09-15 08:09:40 [multiproc_executor.py:596] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker TP0 pid=651) ERROR 09-15 08:09:40 [multiproc_executor.py:596] File "/vllm-workspace/vllm/vllm/model_executor/models/glm4_1v.py", line 408, in forward
(VllmWorker TP0 pid=651) ERROR 09-15 08:09:40 [multiproc_executor.py:596] x = x + self.attn(
(VllmWorker TP0 pid=651) ERROR 09-15 08:09:40 [multiproc_executor.py:596] ^^^^^^^^^^
(VllmWorker TP0 pid=651) ERROR 09-15 08:09:40 [multiproc_executor.py:596] File "/usr/local/python3.11.13/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
(VllmWorker TP0 pid=651) ERROR 09-15 08:09:40 [multiproc_executor.py:596] return self._call_impl(*args, **kwargs)
(VllmWorker TP0 pid=651) ERROR 09-15 08:09:40 [multiproc_executor.py:596] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker TP0 pid=651) ERROR 09-15 08:09:40 [multiproc_executor.py:596] File "/usr/local/python3.11.13/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
(VllmWorker TP0 pid=651) ERROR 09-15 08:09:40 [multiproc_executor.py:596] return forward_call(*args, **kwargs)
(VllmWorker TP0 pid=651) ERROR 09-15 08:09:40 [multiproc_executor.py:596] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker TP0 pid=651) ERROR 09-15 08:09:40 [multiproc_executor.py:596] File "/vllm-workspace/vllm/vllm/model_executor/models/glm4_1v.py", line 299, in forward
(VllmWorker TP0 pid=651) ERROR 09-15 08:09:40 [multiproc_executor.py:596] x, _ = self.qkv(x)
(VllmWorker TP0 pid=651) ERROR 09-15 08:09:40 [multiproc_executor.py:596] ^^^^^^^^^^^
(VllmWorker TP0 pid=651) ERROR 09-15 08:09:40 [multiproc_executor.py:596] File "/usr/local/python3.11.13/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
(VllmWorker TP0 pid=651) ERROR 09-15 08:09:40 [multiproc_executor.py:596] return self._call_impl(*args, **kwargs)
(VllmWorker TP0 pid=651) ERROR 09-15 08:09:40 [multiproc_executor.py:596] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker TP0 pid=651) ERROR 09-15 08:09:40 [multiproc_executor.py:596] File "/usr/local/python3.11.13/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
(VllmWorker TP0 pid=651) ERROR 09-15 08:09:40 [multiproc_executor.py:596] return forward_call(*args, **kwargs)
(VllmWorker TP0 pid=651) ERROR 09-15 08:09:40 [multiproc_executor.py:596] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker TP0 pid=651) ERROR 09-15 08:09:40 [multiproc_executor.py:596] File "/vllm-workspace/vllm/vllm/model_executor/layers/linear.py", line 363, in forward
(VllmWorker TP0 pid=651) ERROR 09-15 08:09:40 [multiproc_executor.py:596] output = self.quant_method.apply(self, x, bias)
(VllmWorker TP0 pid=651) ERROR 09-15 08:09:40 [multiproc_executor.py:596] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker TP0 pid=651) ERROR 09-15 08:09:40 [multiproc_executor.py:596] File "/vllm-workspace/vllm-ascend/vllm_ascend/quantization/quant_config.py", line 240, in apply
(VllmWorker TP0 pid=651) ERROR 09-15 08:09:40 [multiproc_executor.py:596] return self.quant_method.apply(layer, x, bias, tp_rank)
(VllmWorker TP0 pid=651) ERROR 09-15 08:09:40 [multiproc_executor.py:596] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker TP0 pid=651) ERROR 09-15 08:09:40 [multiproc_executor.py:596] File "/vllm-workspace/vllm-ascend/vllm_ascend/quantization/w8a8_dynamic.py", line 92, in apply
(VllmWorker TP0 pid=651) ERROR 09-15 08:09:40 [multiproc_executor.py:596] output = torch_npu.npu_quant_matmul(
(VllmWorker TP0 pid=651) ERROR 09-15 08:09:40 [multiproc_executor.py:596] ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker TP0 pid=651) ERROR 09-15 08:09:40 [multiproc_executor.py:596] File "/usr/local/python3.11.13/lib/python3.11/site-packages/torch/_ops.py", line 1158, in call
(VllmWorker TP0 pid=651) ERROR 09-15 08:09:40 [multiproc_executor.py:596] return self._op(*args, **(kwargs or {}))
(VllmWorker TP0 pid=651) ERROR 09-15 08:09:40 [multiproc_executor.py:596] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker TP0 pid=651) ERROR 09-15 08:09:40 [multiproc_executor.py:596] RuntimeError: npu_quant_matmul_symint:build/CMakeFiles/torch_npu.dir/compiler_depend.ts:204 NPU function error: call aclnnQuantMatmulWeightNz failed, error code is 161002
(VllmWorker TP0 pid=651) ERROR 09-15 08:09:40 [multiproc_executor.py:596] [ERROR] 2025-09-15-08:09:40 (PID:651, Device:0, RankID:-1) ERR00100 PTA call acl api failed.
(VllmWorker TP0 pid=651) ERROR 09-15 08:09:40 [multiproc_executor.py:596] EZ1001: [PID: 651] 2025-09-15-08:09:40.391.235 Expected 1 dimension input, but got pertokenScaleOptional with sizes [29680,1].
(VllmWorker TP0 pid=651) ERROR 09-15 08:09:40 [multiproc_executor.py:596]
(VllmWorker TP0 pid=651) ERROR 09-15 08:09:40 [multiproc_executor.py:596] Traceback (most recent call last):
(VllmWorker TP0 pid=651) ERROR 09-15 08:09:40 [multiproc_executor.py:596] File "/vllm-workspace/vllm/vllm/v1/executor/multiproc_executor.py", line 591, in worker_busy_loop
(VllmWorker TP0 pid=651) ERROR 09-15 08:09:40 [multiproc_executor.py:596] output = func(*args, **kwargs)
(VllmWorker TP0 pid=651) ERROR 09-15 08:09:40 [multiproc_executor.py:596] ^^^^^^^^^^^^^^^^^^^^^
(VllmWorker TP0 pid=651) ERROR 09-15 08:09:40 [multiproc_executor.py:596] File "/vllm-workspace/vllm-ascend/vllm_ascend/worker/worker_v1.py", line 205, in execute_model
(VllmWorker TP0 pid=651) ERROR 09-15 08:09:40 [multiproc_executor.py:596] output = self.model_runner.execute_model(scheduler_output,
(VllmWorker TP0 pid=651) ERROR 09-15 08:09:40 [multiproc_executor.py:596] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker TP0 pid=651) ERROR 09-15 08:09:40 [multiproc_executor.py:596] File "/usr/local/python3.11.13/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorker TP0 pid=651) ERROR 09-15 08:09:40 [multiproc_executor.py:596] return func(*args, **kwargs)
(VllmWorker TP0 pid=651) ERROR 09-15 08:09:40 [multiproc_executor.py:596] ^^^^^^^^^^^^^^^^^^^^^
(VllmWorker TP0 pid=651) ERROR 09-15 08:09:40 [multiproc_executor.py:596] File "/vllm-workspace/vllm-ascend/vllm_ascend/worker/model_runner_v1.py", line 1543, in execute_model
(VllmWorker TP0 pid=651) ERROR 09-15 08:09:40 [multiproc_executor.py:596] intermediate_tensors) = (self._prepare_inputs(
(VllmWorker TP0 pid=651) ERROR 09-15 08:09:40 [multiproc_executor.py:596] ^^^^^^^^^^^^^^^^^^^^^
(VllmWorker TP0 pid=651) ERROR 09-15 08:09:40 [multiproc_executor.py:596] File "/vllm-workspace/vllm-ascend/vllm_ascend/worker/model_runner_v1.py", line 1096, in _prepare_inputs
(VllmWorker TP0 pid=651) ERROR 09-15 08:09:40 [multiproc_executor.py:596] self._execute_mm_encoder(scheduler_output)
(VllmWorker TP0 pid=651) ERROR 09-15 08:09:40 [multiproc_executor.py:596] File "/vllm-workspace/vllm-ascend/vllm_ascend/worker/model_runner_v1.py", line 792, in _execute_mm_encoder
(VllmWorker TP0 pid=651) ERROR 09-15 08:09:40 [multiproc_executor.py:596] curr_group_outputs = self.model.get_multimodal_embeddings(
(VllmWorker TP0 pid=651) ERROR 09-15 08:09:40 [multiproc_executor.py:596] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker TP0 pid=651) ERROR 09-15 08:09:40 [multiproc_executor.py:596] File "/vllm-workspace/vllm/vllm/model_executor/models/glm4_1v.py", line 1468, in get_multimodal_embeddings
(VllmWorker TP0 pid=651) ERROR 09-15 08:09:40 [multiproc_executor.py:596] vision_embeddings = self._process_image_input(multimodal_input)
(VllmWorker TP0 pid=651) ERROR 09-15 08:09:40 [multiproc_executor.py:596] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker TP0 pid=651) ERROR 09-15 08:09:40 [multiproc_executor.py:596] File "/vllm-workspace/vllm/vllm/model_executor/models/glm4_1v.py", line 1403, in _process_image_input
(VllmWorker TP0 pid=651) ERROR 09-15 08:09:40 [multiproc_executor.py:596] image_embeds = self.visual(pixel_values, grid_thw=grid_thw)
(VllmWorker TP0 pid=651) ERROR 09-15 08:09:40 [multiproc_executor.py:596] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker TP0 pid=651) ERROR 09-15 08:09:40 [multiproc_executor.py:596] File "/usr/local/python3.11.13/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
(VllmWorker TP0 pid=651) ERROR 09-15 08:09:40 [multiproc_executor.py:596] return self._call_impl(*args, **kwargs)
(VllmWorker TP0 pid=651) ERROR 09-15 08:09:40 [multiproc_executor.py:596] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker TP0 pid=651) ERROR 09-15 08:09:40 [multiproc_executor.py:596] File "/usr/local/python3.11.13/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
(VllmWorker TP0 pid=651) ERROR 09-15 08:09:40 [multiproc_executor.py:596] return forward_call(*args, **kwargs)
(VllmWorker TP0 pid=651) ERROR 09-15 08:09:40 [multiproc_executor.py:596] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker TP0 pid=651) ERROR 09-15 08:09:40 [multiproc_executor.py:596] File "/vllm-workspace/vllm/vllm/model_executor/models/glm4_1v.py", line 771, in forward
(VllmWorker TP0 pid=651) ERROR 09-15 08:09:40 [multiproc_executor.py:596] x = blk(
(VllmWorker TP0 pid=651) ERROR 09-15 08:09:40 [multiproc_executor.py:596] ^^^^
(VllmWorker TP0 pid=651) ERROR 09-15 08:09:40 [multiproc_executor.py:596] File "/usr/local/python3.11.13/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
(VllmWorker TP0 pid=651) ERROR 09-15 08:09:40 [multiproc_executor.py:596] return self._call_impl(*args, **kwargs)
(VllmWorker TP0 pid=651) ERROR 09-15 08:09:40 [multiproc_executor.py:596] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker TP0 pid=651) ERROR 09-15 08:09:40 [multiproc_executor.py:596] File "/usr/local/python3.11.13/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
(VllmWorker TP0 pid=651) ERROR 09-15 08:09:40 [multiproc_executor.py:596] return forward_call(*args, **kwargs)
(VllmWorker TP0 pid=651) ERROR 09-15 08:09:40 [multiproc_executor.py:596] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker TP0 pid=651) ERROR 09-15 08:09:40 [multiproc_executor.py:596] File "/vllm-workspace/vllm/vllm/model_executor/models/glm4_1v.py", line 408, in forward
(VllmWorker TP0 pid=651) ERROR 09-15 08:09:40 [multiproc_executor.py:596] x = x + self.attn(
(VllmWorker TP0 pid=651) ERROR 09-15 08:09:40 [multiproc_executor.py:596] ^^^^^^^^^^
(VllmWorker TP0 pid=651) ERROR 09-15 08:09:40 [multiproc_executor.py:596] File "/usr/local/python3.11.13/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
(VllmWorker TP0 pid=651) ERROR 09-15 08:09:40 [multiproc_executor.py:596] return self._call_impl(*args, **kwargs)
(VllmWorker TP0 pid=651) ERROR 09-15 08:09:40 [multiproc_executor.py:596] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker TP0 pid=651) ERROR 09-15 08:09:40 [multiproc_executor.py:596] File "/usr/local/python3.11.13/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
(VllmWorker TP0 pid=651) ERROR 09-15 08:09:40 [multiproc_executor.py:596] return forward_call(*args, **kwargs)
(VllmWorker TP0 pid=651) ERROR 09-15 08:09:40 [multiproc_executor.py:596] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker TP0 pid=651) ERROR 09-15 08:09:40 [multiproc_executor.py:596] File "/vllm-workspace/vllm/vllm/model_executor/models/glm4_1v.py", line 299, in forward
(VllmWorker TP0 pid=651) ERROR 09-15 08:09:40 [multiproc_executor.py:596] x, _ = self.qkv(x)
(VllmWorker TP0 pid=651) ERROR 09-15 08:09:40 [multiproc_executor.py:596] ^^^^^^^^^^^
(VllmWorker TP0 pid=651) ERROR 09-15 08:09:40 [multiproc_executor.py:596] File "/usr/local/python3.11.13/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
(VllmWorker TP0 pid=651) ERROR 09-15 08:09:40 [multiproc_executor.py:596] return self._call_impl(*args, **kwargs)
(VllmWorker TP0 pid=651) ERROR 09-15 08:09:40 [multiproc_executor.py:596] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker TP0 pid=651) ERROR 09-15 08:09:40 [multiproc_executor.py:596] File "/usr/local/python3.11.13/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
(VllmWorker TP0 pid=651) ERROR 09-15 08:09:40 [multiproc_executor.py:596] return forward_call(*args, **kwargs)
(VllmWorker TP0 pid=651) ERROR 09-15 08:09:40 [multiproc_executor.py:596] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker TP0 pid=651) ERROR 09-15 08:09:40 [multiproc_executor.py:596] File "/vllm-workspace/vllm/vllm/model_executor/layers/linear.py", line 363, in forward
(VllmWorker TP0 pid=651) ERROR 09-15 08:09:40 [multiproc_executor.py:596] output = self.quant_method.apply(self, x, bias)
(VllmWorker TP0 pid=651) ERROR 09-15 08:09:40 [multiproc_executor.py:596] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker TP0 pid=651) ERROR 09-15 08:09:40 [multiproc_executor.py:596] File "/vllm-workspace/vllm-ascend/vllm_ascend/quantization/quant_config.py", line 240, in apply
(VllmWorker TP0 pid=651) ERROR 09-15 08:09:40 [multiproc_executor.py:596] return self.quant_method.apply(layer, x, bias, tp_rank)
(VllmWorker TP0 pid=651) ERROR 09-15 08:09:40 [multiproc_executor.py:596] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker TP0 pid=651) ERROR 09-15 08:09:40 [multiproc_executor.py:596] File "/vllm-workspace/vllm-ascend/vllm_ascend/quantization/w8a8_dynamic.py", line 92, in apply
(VllmWorker TP0 pid=651) ERROR 09-15 08:09:40 [multiproc_executor.py:596] output = torch_npu.npu_quant_matmul(
(VllmWorker TP0 pid=651) ERROR 09-15 08:09:40 [multiproc_executor.py:596] ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker TP0 pid=651) ERROR 09-15 08:09:40 [multiproc_executor.py:596] File "/usr/local/python3.11.13/lib/python3.11/site-packages/torch/_ops.py", line 1158, in call
(VllmWorker TP0 pid=651) ERROR 09-15 08:09:40 [multiproc_executor.py:596] return self._op(*args, **(kwargs or {}))
(VllmWorker TP0 pid=651) ERROR 09-15 08:09:40 [multiproc_executor.py:596] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker TP0 pid=651) ERROR 09-15 08:09:40 [multiproc_executor.py:596] RuntimeError: npu_quant_matmul_symint:build/CMakeFiles/torch_npu.dir/compiler_depend.ts:204 NPU function error: call aclnnQuantMatmulWeightNz failed, error code is 161002
(VllmWorker TP0 pid=651) ERROR 09-15 08:09:40 [multiproc_executor.py:596] [ERROR] 2025-09-15-08:09:40 (PID:651, Device:0, RankID:-1) ERR00100 PTA call acl api failed.
(VllmWorker TP0 pid=651) ERROR 09-15 08:09:40 [multiproc_executor.py:596] EZ1001: [PID: 651] 2025-09-15-08:09:40.391.235 Expected 1 dimension input, but got pertokenScaleOptional with sizes [29680,1].
(VllmWorker TP0 pid=651) ERROR 09-15 08:09:40 [multiproc_executor.py:596]
(VllmWorker TP0 pid=651) ERROR 09-15 08:09:40 [multiproc_executor.py:596]
(EngineCore_0 pid=515) ERROR 09-15 08:09:40 [dump_input.py:69] Dumping input data for V1 LLM engine (v0.10.1.1) with config: model='/dpc/huangle/llmcompressor/glm45v/GLM-4___5V-W8A8-Dynamic-ALL/', specu
lative_config=None, tokenizer='/dpc/huangle/llmcompressor/glm45v/GLM-4___5V-W8A8-Dynamic-ALL/', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revisi
on=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=65536, download_dir=None, load_format=auto, tensor_parallel_size=8, pipeline_parallel_size=1, disable_custom_all_reduce=True, quantiza
tion=ascend, enforce_eager=False, kv_cache_dtype=auto, device_config=npu, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properti
es=False, reasoning_backend='glm45'), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=g
lm-4.5v, enable_prefix_caching=False, chunked_prefill_enabled=True, use_async_output_proc=True, pooler_config=None, compilation_config={"level":3,"debug_dump_path":"","cache_dir":"","backend":"","custom
_ops":["all"],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output","vllm.mamba_mixer2","vllm.unified_ascend_attention_with_output","vllm.unified_ascend_attention_with_output"],
"use_inductor":false,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"cudagraph_mode":1,"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[40,32,24,16,8,4,2,1],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"pass_config":{},"max_capture_size":40,"local_cache_dir":null},
(EngineCore_0 pid=515) ERROR 09-15 08:09:40 [dump_input.py:76] Dumping scheduler output for model execution: SchedulerOutput(scheduled_new_reqs=[NewRequestData(req_id=chatcmpl-38c733b7095245bda3759fc9e18fea1f,prompt_token_ids_len=7434,mm_kwargs=[{'pixel_values': MultiModalFieldElem(modality='image', key='pixel_values', data=tensor([[ 0.0179, 0.0325, 0.0471, ..., 0.6367, 0.5977, 0.5547],
(EngineCore_0 pid=515) ERROR 09-15 08:09:40 [dump_input.py:76] [ 0.1055, 0.1055, 0.0179, ..., 0.5547, 0.5234, 0.5391],
(EngineCore_0 pid=515) ERROR 09-15 08:09:40 [dump_input.py:76] [-0.0405, -0.0405, -0.0698, ..., 0.6094, 0.5664, 0.5664],
(EngineCore_0 pid=515) ERROR 09-15 08:09:40 [dump_input.py:76] ...,
(EngineCore_0 pid=515) ERROR 09-15 08:09:40 [dump_input.py:76] [-1.5859, -1.6016, -1.6172, ..., -1.0547, -1.0547, -1.0547],
(EngineCore_0 pid=515) ERROR 09-15 08:09:40 [dump_input.py:76] [-1.6016, -1.6016, -1.6016, ..., -1.0703, -1.0547, -1.0547],
(EngineCore_0 pid=515) ERROR 09-15 08:09:40 [dump_input.py:76] [-1.6016, -1.6016, -1.6016, ..., -1.0703, -1.0703, -1.0703]],
(EngineCore_0 pid=515) ERROR 09-15 08:09:40 [dump_input.py:76] dtype=torch.bfloat16), field=MultiModalFlatField(slices=[[slice(0, 29680, None)]], dim=0)), 'image_grid_thw': MultiModalFieldElem(mo
dality='image', key='image_grid_thw', data=tensor([ 1, 140, 212]), field=MultiModalBatchedField())}],mm_hashes=['8cc665a7c0e63c9b4dd4031e47a0b30cf586219565004244f4c31980bc83f713'],mm_positions=[Placeho
lderRange(offset=11, length=7420, is_embed=None)],sampling_params=SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=1.0, top_p=0.0001, top_k=1, min_p=0
.0, seed=None, stop=[], stop_token_ids=[151336, 151338], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=1024, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_specia
l_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None, extra_args=None),block_ids=([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16],),num_computed_t
okens=0,lora_request=None)], scheduled_cached_reqs=CachedRequestData(req_ids=[], resumed_from_preemption=[], new_token_ids=[], new_block_ids=[], num_computed_tokens=[]), num_scheduled_tokens={chatcmpl-3
8c733b7095245bda3759fc9e18fea1f: 2048}, total_num_scheduled_tokens=2048, scheduled_spec_decode_tokens={}, scheduled_encoder_inputs={chatcmpl-38c733b7095245bda3759fc9e18fea1f: [0]}, num_common_prefix_blocks=[0], finished_req_ids=[], free_encoder_input_ids=[], structured_output_request_ids={}, grammar_bitmask=null, kv_connector_metadata=null)
(EngineCore_0 pid=515) ERROR 09-15 08:09:40 [core.py:702] EngineCore encountered a fatal error.
(EngineCore_0 pid=515) ERROR 09-15 08:09:40 [core.py:702] Traceback (most recent call last):
(EngineCore_0 pid=515) ERROR 09-15 08:09:40 [core.py:702] File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 693, in run_engine_core
(EngineCore_0 pid=515) ERROR 09-15 08:09:40 [core.py:702] engine_core.run_busy_loop()
(EngineCore_0 pid=515) ERROR 09-15 08:09:40 [core.py:702] File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 720, in run_busy_loop
(EngineCore_0 pid=515) ERROR 09-15 08:09:40 [core.py:702] self._process_engine_step()
(EngineCore_0 pid=515) ERROR 09-15 08:09:40 [core.py:702] File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 745, in _process_engine_step
(EngineCore_0 pid=515) ERROR 09-15 08:09:40 [core.py:702] outputs, model_executed = self.step_fn()
(EngineCore_0 pid=515) ERROR 09-15 08:09:40 [core.py:702] ^^^^^^^^^^^^^^
(EngineCore_0 pid=515) ERROR 09-15 08:09:40 [core.py:702] File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 288, in step
(EngineCore_0 pid=515) ERROR 09-15 08:09:40 [core.py:702] model_output = self.execute_model_with_error_logging(
(EngineCore_0 pid=515) ERROR 09-15 08:09:40 [core.py:702] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=515) ERROR 09-15 08:09:40 [core.py:702] File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 274, in execute_model_with_error_logging
(EngineCore_0 pid=515) ERROR 09-15 08:09:40 [core.py:702] raise err
(EngineCore_0 pid=515) ERROR 09-15 08:09:40 [core.py:702] File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 265, in execute_model_with_error_logging
(EngineCore_0 pid=515) ERROR 09-15 08:09:40 [core.py:702] return model_fn(scheduler_output)
(EngineCore_0 pid=515) ERROR 09-15 08:09:40 [core.py:702] ^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=515) ERROR 09-15 08:09:40 [core.py:702] File "/vllm-workspace/vllm/vllm/v1/executor/multiproc_executor.py", line 173, in execute_model
(EngineCore_0 pid=515) ERROR 09-15 08:09:40 [core.py:702] (output, ) = self.collective_rpc(
(EngineCore_0 pid=515) ERROR 09-15 08:09:40 [core.py:702] ^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=515) ERROR 09-15 08:09:40 [core.py:702] File "/vllm-workspace/vllm/vllm/v1/executor/multiproc_executor.py", line 243, in collective_rpc
(EngineCore_0 pid=515) ERROR 09-15 08:09:40 [core.py:702] result = get_response(w, dequeue_timeout)
(EngineCore_0 pid=515) ERROR 09-15 08:09:40 [core.py:702] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=515) ERROR 09-15 08:09:40 [core.py:702] File "/vllm-workspace/vllm/vllm/v1/executor/multiproc_executor.py", line 230, in get_response
(EngineCore_0 pid=515) ERROR 09-15 08:09:40 [core.py:702] raise RuntimeError(
(EngineCore_0 pid=515) ERROR 09-15 08:09:40 [core.py:702] RuntimeError: Worker failed with error 'npu_quant_matmul_symint:build/CMakeFiles/torch_npu.dir/compiler_depend.ts:204 NPU function error: call aclnnQuantMatmulWeightNz failed, error code is 161002
(EngineCore_0 pid=515) ERROR 09-15 08:09:40 [core.py:702] [ERROR] 2025-09-15-08:09:40 (PID:651, Device:0, RankID:-1) ERR00100 PTA call acl api failed.
(EngineCore_0 pid=515) ERROR 09-15 08:09:40 [core.py:702] EZ1001: [PID: 651] 2025-09-15-08:09:40.391.235 Expected 1 dimension input, but got pertokenScaleOptional with sizes [29680,1].
(EngineCore_0 pid=515) ERROR 09-15 08:09:40 [core.py:702] ', please check the stack trace above for the root cause
(VllmWorker TP0 pid=651) INFO 09-15 08:09:40 [multiproc_executor.py:520] Parent process exited, terminating worker
(VllmWorker TP1 pid=657) INFO 09-15 08:09:40 [multiproc_executor.py:520] Parent process exited, terminating worker
(VllmWorker TP2 pid=676) INFO 09-15 08:09:40 [multiproc_executor.py:520] Parent process exited, terminating worker
(VllmWorker TP3 pid=914) INFO 09-15 08:09:40 [multiproc_executor.py:520] Parent process exited, terminating worker
(VllmWorker TP4 pid=1159) INFO 09-15 08:09:40 [multiproc_executor.py:520] Parent process exited, terminating worker
(VllmWorker TP5 pid=1404) INFO 09-15 08:09:40 [multiproc_executor.py:520] Parent process exited, terminating worker
(VllmWorker TP6 pid=1649) INFO 09-15 08:09:40 [multiproc_executor.py:520] Parent process exited, terminating worker
(VllmWorker TP7 pid=1894) INFO 09-15 08:09:40 [multiproc_executor.py:520] Parent process exited, terminating worker
(APIServer pid=246) INFO: 127.0.0.1:38958 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=246) ERROR 09-15 08:09:43 [async_llm.py:430] AsyncLLM output_handler failed.
(APIServer pid=246) ERROR 09-15 08:09:43 [async_llm.py:430] Traceback (most recent call last):
(APIServer pid=246) ERROR 09-15 08:09:43 [async_llm.py:430] File "/vllm-workspace/vllm/vllm/v1/engine/async_llm.py", line 389, in output_handler
(APIServer pid=246) ERROR 09-15 08:09:43 [async_llm.py:430] outputs = await engine_core.get_output_async()
(APIServer pid=246) ERROR 09-15 08:09:43 [async_llm.py:430] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=246) ERROR 09-15 08:09:43 [async_llm.py:430] File "/vllm-workspace/vllm/vllm/v1/engine/core_client.py", line 843, in get_output_async
(APIServer pid=246) ERROR 09-15 08:09:43 [async_llm.py:430] raise self._format_exception(outputs) from None
(APIServer pid=246) ERROR 09-15 08:09:43 [async_llm.py:430] vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause.
(APIServer pid=246) ERROR 09-15 08:09:43 [serving_chat.py:1051] Error in chat completion stream generator.
(APIServer pid=246) ERROR 09-15 08:09:43 [serving_chat.py:1051] Traceback (most recent call last):
(APIServer pid=246) ERROR 09-15 08:09:43 [serving_chat.py:1051] File "/vllm-workspace/vllm/vllm/entrypoints/openai/serving_chat.py", line 545, in chat_completion_stream_generator
(APIServer pid=246) ERROR 09-15 08:09:43 [serving_chat.py:1051] async for res in result_generator:
(APIServer pid=246) ERROR 09-15 08:09:43 [serving_chat.py:1051] File "/vllm-workspace/vllm/vllm/v1/engine/async_llm.py", line 337, in generate
(APIServer pid=246) ERROR 09-15 08:09:43 [serving_chat.py:1051] out = q.get_nowait() or await q.get()
(APIServer pid=246) ERROR 09-15 08:09:43 [serving_chat.py:1051] ^^^^^^^^^^^^^
(APIServer pid=246) ERROR 09-15 08:09:43 [serving_chat.py:1051] File "/vllm-workspace/vllm/vllm/v1/engine/output_processor.py", line 57, in get
(APIServer pid=246) ERROR 09-15 08:09:43 [serving_chat.py:1051] raise output
(APIServer pid=246) ERROR 09-15 08:09:43 [serving_chat.py:1051] File "/vllm-workspace/vllm/vllm/v1/engine/async_llm.py", line 389, in output_handler
(APIServer pid=246) ERROR 09-15 08:09:43 [serving_chat.py:1051] outputs = await engine_core.get_output_async()
(APIServer pid=246) ERROR 09-15 08:09:43 [serving_chat.py:1051] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=246) ERROR 09-15 08:09:43 [serving_chat.py:1051] File "/vllm-workspace/vllm/vllm/v1/engine/core_client.py", line 843, in get_output_async
(APIServer pid=246) ERROR 09-15 08:09:43 [serving_chat.py:1051] raise self._format_exception(outputs) from None
(APIServer pid=246) ERROR 09-15 08:09:43 [serving_chat.py:1051] vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause.
(APIServer pid=246) ERROR 09-15 08:09:43 [serving_chat.py:1051] Error in chat completion stream generator.
(APIServer pid=246) ERROR 09-15 08:09:43 [serving_chat.py:1051] Traceback (most recent call last):
(APIServer pid=246) ERROR 09-15 08:09:43 [serving_chat.py:1051] File "/vllm-workspace/vllm/vllm/entrypoints/openai/serving_chat.py", line 545, in chat_completion_stream_generator
(APIServer pid=246) ERROR 09-15 08:09:43 [serving_chat.py:1051] async for res in result_generator:
(APIServer pid=246) ERROR 09-15 08:09:43 [serving_chat.py:1051] File "/vllm-workspace/vllm/vllm/v1/engine/async_llm.py", line 337, in generate
(APIServer pid=246) ERROR 09-15 08:09:43 [serving_chat.py:1051] out = q.get_nowait() or await q.get()
(APIServer pid=246) ERROR 09-15 08:09:43 [serving_chat.py:1051] ^^^^^^^^^^^^^
(APIServer pid=246) ERROR 09-15 08:09:43 [serving_chat.py:1051] File "/vllm-workspace/vllm/vllm/v1/engine/output_processor.py", line 57, in get
(APIServer pid=246) ERROR 09-15 08:09:43 [serving_chat.py:1051] raise output
(APIServer pid=246) ERROR 09-15 08:09:43 [serving_chat.py:1051] File "/vllm-workspace/vllm/vllm/entrypoints/openai/serving_chat.py", line 545, in chat_completion_stream_generator
(APIServer pid=246) ERROR 09-15 08:09:43 [serving_chat.py:1051] async for res in result_generator:
(APIServer pid=246) ERROR 09-15 08:09:43 [serving_chat.py:1051] File "/vllm-workspace/vllm/vllm/v1/engine/async_llm.py", line 337, in generate
(APIServer pid=246) ERROR 09-15 08:09:43 [serving_chat.py:1051] out = q.get_nowait() or await q.get()
(APIServer pid=246) ERROR 09-15 08:09:43 [serving_chat.py:1051] ^^^^^^^^^^^^^
(APIServer pid=246) ERROR 09-15 08:09:43 [serving_chat.py:1051] File "/vllm-workspace/vllm/vllm/v1/engine/output_processor.py", line 57, in get
(APIServer pid=246) ERROR 09-15 08:09:43 [serving_chat.py:1051] raise output
(APIServer pid=246) ERROR 09-15 08:09:43 [serving_chat.py:1051] File "/vllm-workspace/vllm/vllm/v1/engine/async_llm.py", line 389, in output_handler
(APIServer pid=246) ERROR 09-15 08:09:43 [serving_chat.py:1051] outputs = await engine_core.get_output_async()
(APIServer pid=246) ERROR 09-15 08:09:43 [serving_chat.py:1051] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=246) ERROR 09-15 08:09:43 [serving_chat.py:1051] File "/vllm-workspace/vllm/vllm/v1/engine/core_client.py", line 843, in get_output_async
(APIServer pid=246) ERROR 09-15 08:09:43 [serving_chat.py:1051] raise self._format_exception(outputs) from None
(APIServer pid=246) ERROR 09-15 08:09:43 [serving_chat.py:1051] vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause.
(APIServer pid=246) ERROR 09-15 08:09:43 [serving_chat.py:1051] Error in chat completion stream generator.
(APIServer pid=246) ERROR 09-15 08:09:43 [serving_chat.py:1051] Traceback (most recent call last):
(APIServer pid=246) ERROR 09-15 08:09:43 [serving_chat.py:1051] File "/vllm-workspace/vllm/vllm/entrypoints/openai/serving_chat.py", line 545, in chat_completion_stream_generator
(APIServer pid=246) ERROR 09-15 08:09:43 [serving_chat.py:1051] async for res in result_generator:
(APIServer pid=246) ERROR 09-15 08:09:43 [serving_chat.py:1051] File "/vllm-workspace/vllm/vllm/v1/engine/async_llm.py", line 337, in generate
(APIServer pid=246) ERROR 09-15 08:09:43 [serving_chat.py:1051] out = q.get_nowait() or await q.get()
(APIServer pid=246) ERROR 09-15 08:09:43 [serving_chat.py:1051] ^^^^^^^^^^^^^
(APIServer pid=246) ERROR 09-15 08:09:43 [serving_chat.py:1051] File "/vllm-workspace/vllm/vllm/v1/engine/output_processor.py", line 57, in get
(APIServer pid=246) ERROR 09-15 08:09:43 [serving_chat.py:1051] raise output
(APIServer pid=246) ERROR 09-15 08:09:43 [serving_chat.py:1051] File "/vllm-workspace/vllm/vllm/entrypoints/openai/serving_chat.py", line 545, in chat_completion_stream_generator
(APIServer pid=246) ERROR 09-15 08:09:43 [serving_chat.py:1051] async for res in result_generator:
(APIServer pid=246) ERROR 09-15 08:09:43 [serving_chat.py:1051] File "/vllm-workspace/vllm/vllm/v1/engine/async_llm.py", line 337, in generate
(APIServer pid=246) ERROR 09-15 08:09:43 [serving_chat.py:1051] out = q.get_nowait() or await q.get()
(APIServer pid=246) ERROR 09-15 08:09:43 [serving_chat.py:1051] ^^^^^^^^^^^^^
(APIServer pid=246) ERROR 09-15 08:09:43 [serving_chat.py:1051] File "/vllm-workspace/vllm/vllm/v1/engine/output_processor.py", line 57, in get
(APIServer pid=246) ERROR 09-15 08:09:43 [serving_chat.py:1051] raise output
(APIServer pid=246) ERROR 09-15 08:09:43 [serving_chat.py:1051] File "/vllm-workspace/vllm/vllm/entrypoints/openai/serving_chat.py", line 545, in chat_completion_stream_generator
(APIServer pid=246) ERROR 09-15 08:09:43 [serving_chat.py:1051] async for res in result_generator:
(APIServer pid=246) ERROR 09-15 08:09:43 [serving_chat.py:1051] File "/vllm-workspace/vllm/vllm/v1/engine/async_llm.py", line 337, in generate
(APIServer pid=246) ERROR 09-15 08:09:43 [serving_chat.py:1051] out = q.get_nowait() or await q.get()
(APIServer pid=246) ERROR 09-15 08:09:43 [serving_chat.py:1051] ^^^^^^^^^^^^^
(APIServer pid=246) ERROR 09-15 08:09:43 [serving_chat.py:1051] File "/vllm-workspace/vllm/vllm/v1/engine/output_processor.py", line 57, in get
(APIServer pid=246) ERROR 09-15 08:09:43 [serving_chat.py:1051] raise output
(APIServer pid=246) ERROR 09-15 08:09:43 [serving_chat.py:1051] File "/vllm-workspace/vllm/vllm/v1/engine/async_llm.py", line 389, in output_handler
(APIServer pid=246) ERROR 09-15 08:09:43 [serving_chat.py:1051] outputs = await engine_core.get_output_async()
(APIServer pid=246) ERROR 09-15 08:09:43 [serving_chat.py:1051] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=246) ERROR 09-15 08:09:43 [serving_chat.py:1051] File "/vllm-workspace/vllm/vllm/v1/engine/core_client.py", line 843, in get_output_async
(APIServer pid=246) ERROR 09-15 08:09:43 [serving_chat.py:1051] raise self._format_exception(outputs) from None
(APIServer pid=246) ERROR 09-15 08:09:43 [serving_chat.py:1051] vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause.
(APIServer pid=246) INFO: 127.0.0.1:38960 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=246) ERROR 09-15 08:09:43 [serving_chat.py:1051] Error in chat completion stream generator.
(APIServer pid=246) ERROR 09-15 08:09:43 [serving_chat.py:1051] Traceback (most recent call last):
(APIServer pid=246) ERROR 09-15 08:09:43 [serving_chat.py:1051] File "/vllm-workspace/vllm/vllm/entrypoints/openai/serving_chat.py", line 545, in chat_completion_stream_generator
(APIServer pid=246) ERROR 09-15 08:09:43 [serving_chat.py:1051] async for res in result_generator:
(APIServer pid=246) ERROR 09-15 08:09:43 [serving_chat.py:1051] File "/vllm-workspace/vllm/vllm/v1/engine/async_llm.py", line 321, in generate
(APIServer pid=246) ERROR 09-15 08:09:43 [serving_chat.py:1051] q = await self.add_request(
(APIServer pid=246) ERROR 09-15 08:09:43 [serving_chat.py:1051] ^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=246) ERROR 09-15 08:09:43 [serving_chat.py:1051] File "/vllm-workspace/vllm/vllm/v1/engine/async_llm.py", line 243, in add_request
(APIServer pid=246) ERROR 09-15 08:09:43 [serving_chat.py:1051] raise EngineDeadError()
(APIServer pid=246) ERROR 09-15 08:09:43 [serving_chat.py:1051] vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause.
(APIServer pid=246) INFO: 127.0.0.1:38964 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=246) ERROR 09-15 08:09:43 [serving_chat.py:1051] Error in chat completion stream generator.
(APIServer pid=246) ERROR 09-15 08:09:43 [serving_chat.py:1051] Traceback (most recent call last):
(APIServer pid=246) ERROR 09-15 08:09:43 [serving_chat.py:1051] File "/vllm-workspace/vllm/vllm/entrypoints/openai/serving_chat.py", line 545, in chat_completion_stream_generator
(APIServer pid=246) ERROR 09-15 08:09:43 [serving_chat.py:1051] async for res in result_generator:
(APIServer pid=246) ERROR 09-15 08:09:43 [serving_chat.py:1051] File "/vllm-workspace/vllm/vllm/v1/engine/async_llm.py", line 321, in generate
(APIServer pid=246) ERROR 09-15 08:09:43 [serving_chat.py:1051] q = await self.add_request(
(APIServer pid=246) ERROR 09-15 08:09:43 [serving_chat.py:1051] ^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=246) ERROR 09-15 08:09:43 [serving_chat.py:1051] File "/vllm-workspace/vllm/vllm/v1/engine/async_llm.py", line 243, in add_request
(APIServer pid=246) ERROR 09-15 08:09:43 [serving_chat.py:1051] raise EngineDeadError()
(APIServer pid=246) ERROR 09-15 08:09:43 [serving_chat.py:1051] vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause.
(APIServer pid=246) INFO: 127.0.0.1:38966 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=246) ERROR 09-15 08:09:43 [serving_chat.py:1051] Error in chat completion stream generator.
(APIServer pid=246) ERROR 09-15 08:09:43 [serving_chat.py:1051] Traceback (most recent call last):
(APIServer pid=246) ERROR 09-15 08:09:43 [serving_chat.py:1051] File "/vllm-workspace/vllm/vllm/entrypoints/openai/serving_chat.py", line 545, in chat_completion_stream_generator
(APIServer pid=246) ERROR 09-15 08:09:43 [serving_chat.py:1051] async for res in result_generator:
(APIServer pid=246) ERROR 09-15 08:09:43 [serving_chat.py:1051] File "/vllm-workspace/vllm/vllm/v1/engine/async_llm.py", line 321, in generate
(APIServer pid=246) ERROR 09-15 08:09:43 [serving_chat.py:1051] q = await self.add_request(
(APIServer pid=246) ERROR 09-15 08:09:43 [serving_chat.py:1051] ^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=246) ERROR 09-15 08:09:43 [serving_chat.py:1051] File "/vllm-workspace/vllm/vllm/v1/engine/async_llm.py", line 243, in add_request
(APIServer pid=246) ERROR 09-15 08:09:43 [serving_chat.py:1051] raise EngineDeadError()
(APIServer pid=246) ERROR 09-15 08:09:43 [serving_chat.py:1051] vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause.
(APIServer pid=246) INFO: 127.0.0.1:38962 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=246) ERROR 09-15 08:09:43 [serving_chat.py:1051] Error in chat completion stream generator.
(APIServer pid=246) ERROR 09-15 08:09:43 [serving_chat.py:1051] Traceback (most recent call last):
(APIServer pid=246) ERROR 09-15 08:09:43 [serving_chat.py:1051] File "/vllm-workspace/vllm/vllm/entrypoints/openai/serving_chat.py", line 545, in chat_completion_stream_generator
(APIServer pid=246) ERROR 09-15 08:09:43 [serving_chat.py:1051] async for res in result_generator:
(APIServer pid=246) ERROR 09-15 08:09:43 [serving_chat.py:1051] File "/vllm-workspace/vllm/vllm/v1/engine/async_llm.py", line 321, in generate
(APIServer pid=246) ERROR 09-15 08:09:43 [serving_chat.py:1051] q = await self.add_request(
(APIServer pid=246) ERROR 09-15 08:09:43 [serving_chat.py:1051] ^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=246) ERROR 09-15 08:09:43 [serving_chat.py:1051] File "/vllm-workspace/vllm/vllm/v1/engine/async_llm.py", line 243, in add_request
(APIServer pid=246) ERROR 09-15 08:09:43 [serving_chat.py:1051] raise EngineDeadError()
(APIServer pid=246) ERROR 09-15 08:09:43 [serving_chat.py:1051] vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause.
(APIServer pid=246) INFO: 127.0.0.1:38968 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=246) ERROR 09-15 08:09:43 [serving_chat.py:1051] Error in chat completion stream generator.
(APIServer pid=246) ERROR 09-15 08:09:43 [serving_chat.py:1051] Traceback (most recent call last):
(APIServer pid=246) ERROR 09-15 08:09:43 [serving_chat.py:1051] File "/vllm-workspace/vllm/vllm/entrypoints/openai/serving_chat.py", line 545, in chat_completion_stream_generator
(APIServer pid=246) ERROR 09-15 08:09:43 [serving_chat.py:1051] async for res in result_generator:
(APIServer pid=246) ERROR 09-15 08:09:43 [serving_chat.py:1051] File "/vllm-workspace/vllm/vllm/v1/engine/async_llm.py", line 321, in generate
(APIServer pid=246) ERROR 09-15 08:09:43 [serving_chat.py:1051] q = await self.add_request(
(APIServer pid=246) ERROR 09-15 08:09:43 [serving_chat.py:1051] ^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=246) ERROR 09-15 08:09:43 [serving_chat.py:1051] File "/vllm-workspace/vllm/vllm/v1/engine/async_llm.py", line 243, in add_request
(APIServer pid=246) ERROR 09-15 08:09:43 [serving_chat.py:1051] raise EngineDeadError()
(APIServer pid=246) ERROR 09-15 08:09:43 [serving_chat.py:1051] vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause.
(APIServer pid=246) INFO: 127.0.0.1:38972 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=246) ERROR 09-15 08:09:43 [serving_chat.py:1051] Error in chat completion stream generator.
(APIServer pid=246) ERROR 09-15 08:09:43 [serving_chat.py:1051] Traceback (most recent call last):
(APIServer pid=246) ERROR 09-15 08:09:43 [serving_chat.py:1051] File "/vllm-workspace/vllm/vllm/entrypoints/openai/serving_chat.py", line 545, in chat_completion_stream_generator
(APIServer pid=246) ERROR 09-15 08:09:43 [serving_chat.py:1051] async for res in result_generator:
(APIServer pid=246) ERROR 09-15 08:09:43 [serving_chat.py:1051] File "/vllm-workspace/vllm/vllm/v1/engine/async_llm.py", line 321, in generate
(APIServer pid=246) ERROR 09-15 08:09:43 [serving_chat.py:1051] q = await self.add_request(
(APIServer pid=246) ERROR 09-15 08:09:43 [serving_chat.py:1051] ^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=246) ERROR 09-15 08:09:43 [serving_chat.py:1051] File "/vllm-workspace/vllm/vllm/v1/engine/async_llm.py", line 243, in add_request
(APIServer pid=246) ERROR 09-15 08:09:43 [serving_chat.py:1051] raise EngineDeadError()
(APIServer pid=246) ERROR 09-15 08:09:43 [serving_chat.py:1051] vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause.
(APIServer pid=246) INFO: 127.0.0.1:38976 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=246) ERROR 09-15 08:09:43 [serving_chat.py:1051] Error in chat completion stream generator.
(APIServer pid=246) ERROR 09-15 08:09:43 [serving_chat.py:1051] Traceback (most recent call last):
(APIServer pid=246) ERROR 09-15 08:09:43 [serving_chat.py:1051] File "/vllm-workspace/vllm/vllm/entrypoints/openai/serving_chat.py", line 545, in chat_completion_stream_generator
(APIServer pid=246) ERROR 09-15 08:09:43 [serving_chat.py:1051] async for res in result_generator:
(APIServer pid=246) ERROR 09-15 08:09:43 [serving_chat.py:1051] File "/vllm-workspace/vllm/vllm/v1/engine/async_llm.py", line 321, in generate
(APIServer pid=246) ERROR 09-15 08:09:43 [serving_chat.py:1051] q = await self.add_request(
(APIServer pid=246) ERROR 09-15 08:09:43 [serving_chat.py:1051] ^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=246) ERROR 09-15 08:09:43 [serving_chat.py:1051] File "/vllm-workspace/vllm/vllm/v1/engine/async_llm.py", line 243, in add_request
(APIServer pid=246) ERROR 09-15 08:09:43 [serving_chat.py:1051] raise EngineDeadError()
(APIServer pid=246) ERROR 09-15 08:09:43 [serving_chat.py:1051] vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause.
(APIServer pid=246) INFO: 127.0.0.1:38978 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=246) ERROR 09-15 08:09:43 [serving_chat.py:1051] Error in chat completion stream generator.
(APIServer pid=246) ERROR 09-15 08:09:43 [serving_chat.py:1051] Traceback (most recent call last):
(APIServer pid=246) ERROR 09-15 08:09:43 [serving_chat.py:1051] File "/vllm-workspace/vllm/vllm/entrypoints/openai/serving_chat.py", line 545, in chat_completion_stream_generator
(APIServer pid=246) ERROR 09-15 08:09:43 [serving_chat.py:1051] async for res in result_generator:
(APIServer pid=246) ERROR 09-15 08:09:43 [serving_chat.py:1051] File "/vllm-workspace/vllm/vllm/v1/engine/async_llm.py", line 321, in generate
(APIServer pid=246) ERROR 09-15 08:09:43 [serving_chat.py:1051] q = await self.add_request(
(APIServer pid=246) ERROR 09-15 08:09:43 [serving_chat.py:1051] ^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=246) ERROR 09-15 08:09:43 [serving_chat.py:1051] File "/vllm-workspace/vllm/vllm/v1/engine/async_llm.py", line 243, in add_request
(APIServer pid=246) ERROR 09-15 08:09:43 [serving_chat.py:1051] raise EngineDeadError()
(APIServer pid=246) ERROR 09-15 08:09:43 [serving_chat.py:1051] vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause.
(APIServer pid=246) INFO: 127.0.0.1:38980 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=246) ERROR 09-15 08:09:43 [serving_chat.py:1051] Error in chat completion stream generator.
(APIServer pid=246) ERROR 09-15 08:09:43 [serving_chat.py:1051] Traceback (most recent call last):
(APIServer pid=246) ERROR 09-15 08:09:43 [serving_chat.py:1051] File "/vllm-workspace/vllm/vllm/entrypoints/openai/serving_chat.py", line 545, in chat_completion_stream_generator
(APIServer pid=246) ERROR 09-15 08:09:43 [serving_chat.py:1051] async for res in result_generator:
(APIServer pid=246) ERROR 09-15 08:09:43 [serving_chat.py:1051] File "/vllm-workspace/vllm/vllm/v1/engine/async_llm.py", line 321, in generate
(APIServer pid=246) ERROR 09-15 08:09:43 [serving_chat.py:1051] q = await self.add_request(
(APIServer pid=246) ERROR 09-15 08:09:43 [serving_chat.py:1051] ^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=246) ERROR 09-15 08:09:43 [serving_chat.py:1051] File "/vllm-workspace/vllm/vllm/v1/engine/async_llm.py", line 243, in add_request
(APIServer pid=246) ERROR 09-15 08:09:43 [serving_chat.py:1051] raise EngineDeadError()
(APIServer pid=246) ERROR 09-15 08:09:43 [serving_chat.py:1051] vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause.
(APIServer pid=246) WARNING 09-15 08:09:43 [protocol.py:81] The following fields were present in the request but ignored: {'do_sample'}
(APIServer pid=246) INFO: 127.0.0.1:38956 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
(APIServer pid=246) WARNING 09-15 08:09:43 [protocol.py:81] The following fields were present in the request but ignored: {'do_sample'}
(APIServer pid=246) WARNING 09-15 08:09:43 [protocol.py:81] The following fields were present in the request but ignored: {'do_sample'}
(APIServer pid=246) WARNING 09-15 08:09:43 [protocol.py:81] The following fields were present in the request but ignored: {'do_sample'}
(APIServer pid=246) INFO: 127.0.0.1:38958 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
(APIServer pid=246) INFO: 127.0.0.1:38984 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
(APIServer pid=246) INFO: 127.0.0.1:38986 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
(APIServer pid=246) WARNING 09-15 08:09:43 [protocol.py:81] The following fields were present in the request but ignored: {'do_sample'}
(APIServer pid=246) INFO: 127.0.0.1:38990 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
(APIServer pid=246) WARNING 09-15 08:09:44 [protocol.py:81] The following fields were present in the request but ignored: {'do_sample'}
(APIServer pid=246) WARNING 09-15 08:09:44 [protocol.py:81] The following fields were present in the request but ignored: {'do_sample'}
(APIServer pid=246) INFO: Shutting down
(APIServer pid=246) WARNING 09-15 08:09:44 [protocol.py:81] The following fields were present in the request but ignored: {'do_sample'}
(APIServer pid=246) INFO: 127.0.0.1:38968 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
(APIServer pid=246) INFO: 127.0.0.1:38972 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
(APIServer pid=246) INFO: 127.0.0.1:38992 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
(APIServer pid=246) WARNING 09-15 08:09:44 [protocol.py:81] The following fields were present in the request but ignored: {'do_sample'}
(APIServer pid=246) WARNING 09-15 08:09:44 [protocol.py:81] The following fields were present in the request but ignored: {'do_sample'}
(APIServer pid=246) INFO: 127.0.0.1:38988 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
(APIServer pid=246) WARNING 09-15 08:09:44 [protocol.py:81] The following fields were present in the request but ignored: {'do_sample'}
(APIServer pid=246) INFO: 127.0.0.1:38980 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
(APIServer pid=246) WARNING 09-15 08:09:44 [protocol.py:81] The following fields were present in the request but ignored: {'do_sample'}
(APIServer pid=246) INFO: 127.0.0.1:38990 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
(APIServer pid=246) INFO: Waiting for connections to close. (CTRL+C to force quit)
(APIServer pid=246) WARNING 09-15 08:09:44 [protocol.py:81] The following fields were present in the request but ignored: {'do_sample'}
(APIServer pid=246) INFO: 127.0.0.1:38958 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
(APIServer pid=246) WARNING 09-15 08:09:44 [protocol.py:81] The following fields were present in the request but ignored: {'do_sample'}
(APIServer pid=246) INFO: 127.0.0.1:38984 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
(APIServer pid=246) INFO: 127.0.0.1:38978 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
(APIServer pid=246) INFO: Waiting for application shutdown.
(APIServer pid=246) INFO: Application shutdown complete.
(APIServer pid=246) INFO: Finished server process [246]