vllm video inference error

vllm: 10.0+torch 2.7.1+cuda12.6
transformers: 4.56.0dev0

command: vllm serve GLM-4.5V -tp 4

```
if __name__ == "__main__":
    video_path = "test.mp4"
    
    # video of 10s, 5FPS
    video_base64 = process_video_to_base64(video_path)
    if not video_base64:
        exit(1)
    
    # 准备消息内容
    user_message = "What's in this video?"
    text_tokens = count_tokens(user_message)
    video_tokens = estimate_video_tokens(video_base64)
    
    print(f"\n输入文本token数: {text_tokens}")
    print(f"视频数据token数: {video_tokens}")
    print(f"预估总输入token数: {text_tokens + video_tokens}")
    
    # 发送请求
    response = client.chat.completions.create(
        model=model,
        messages=[{
            "role": "user",
            "content": [
                {"type": "text", "text": user_message},
                {"type": "video_url", "video_url": {"url": f"data:video/mp4;base64,{video_base64}"}}
            ]
        }],
        max_tokens=16384,
        extra_body={
            "media_io_kwargs": {
                "video": {
                    "num_frames": 50
                }
            }
        },
        temperature=0.6
    )
    
    # 获取响应内容
    response_content = response.choices[0].message.content
    
    # 计算token统计
    prompt_tokens = response.usage.prompt_tokens if hasattr(response, 'usage') else text_tokens + video_tokens
    completion_tokens = response.usage.completion_tokens if hasattr(response, 'usage') else count_tokens(response_content)
    total_tokens = response.usage.total_tokens if hasattr(response, 'usage') else prompt_tokens + completion_tokens
    
    # 打印统计信息
    print_token_stats(prompt_tokens, completion_tokens, total_tokens)
    
    # 打印响应内容
    print("\n响应内容:")
    print("-" * 50)
    print(response_content)
```

I raned the script twice. First time the inference was fine but token count is significat lower tha it should be:

```
成功处理视频，数据大小: 4502212 字节
视频token估算: 50帧 × 576.00 tokens/帧 = 28,800 tokens

输入文本token数: 6
视频数据token数: 28800
预估总输入token数: 28806

==================================================
TOKEN统计信息
==================================================
输入token数: 4,067
输出token数: 65
总token数: 4,132
估算成本: $0.0426
==================================================
```
And in the server side I noticed:

```
(APIServer pid=335067) WARNING 08-19 21:56:02 [glm4_1v.py:1095] Total frames in metadata (50) does not match the length of video array 32. This can be because the video is resampled in advance. This may cause a divergence with HF implementation.
```

Then I run the **same** script the second time got:
```
(APIServer pid=335067) WARNING 08-19 21:57:38 [protocol.py:81] The following fields were present in the request but ignored: {'media_io_kwargs'}
(VllmWorker TP3 pid=335467) ERROR 08-19 21:57:38 [multiproc_executor.py:596] WorkerProc hit an exception.
(VllmWorker TP3 pid=335467) ERROR 08-19 21:57:38 [multiproc_executor.py:596] Traceback (most recent call last):
(VllmWorker TP3 pid=335467) ERROR 08-19 21:57:38 [multiproc_executor.py:596]   File "/usr/local/lib/python3.10/dist-packages/vllm/v1/executor/multiproc_executor.py", line 591, in worker_busy_loop
(VllmWorker TP3 pid=335467) ERROR 08-19 21:57:38 [multiproc_executor.py:596]     output = func(*args, **kwargs)
(VllmWorker TP3 pid=335467) ERROR 08-19 21:57:38 [multiproc_executor.py:596]   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorker TP3 pid=335467) ERROR 08-19 21:57:38 [multiproc_executor.py:596]     return func(*args, **kwargs)
(VllmWorker TP3 pid=335467) ERROR 08-19 21:57:38 [multiproc_executor.py:596]   File "/usr/local/lib/python3.10/dist-packages/vllm/v1/worker/gpu_worker.py", line 362, in execute_model
(VllmWorker TP3 pid=335467) ERROR 08-19 21:57:38 [multiproc_executor.py:596]     output = self.model_runner.execute_model(scheduler_output,
(VllmWorker TP3 pid=335467) ERROR 08-19 21:57:38 [multiproc_executor.py:596]   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorker TP3 pid=335467) ERROR 08-19 21:57:38 [multiproc_executor.py:596]     return func(*args, **kwargs)
(VllmWorker TP3 pid=335467) ERROR 08-19 21:57:38 [multiproc_executor.py:596]   File "/usr/local/lib/python3.10/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 1550, in execute_model
(VllmWorker TP3 pid=335467) ERROR 08-19 21:57:38 [multiproc_executor.py:596]     self._execute_mm_encoder(scheduler_output)
(VllmWorker TP3 pid=335467) ERROR 08-19 21:57:38 [multiproc_executor.py:596]   File "/usr/local/lib/python3.10/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 1213, in _execute_mm_encoder
(VllmWorker TP3 pid=335467) ERROR 08-19 21:57:38 [multiproc_executor.py:596]     self.encoder_cache[req_id][input_id] = scatter_mm_placeholders(
(VllmWorker TP3 pid=335467) ERROR 08-19 21:57:38 [multiproc_executor.py:596]   File "/usr/local/lib/python3.10/dist-packages/vllm/v1/worker/utils.py", line 186, in scatter_mm_placeholders
(VllmWorker TP3 pid=335467) ERROR 08-19 21:57:38 [multiproc_executor.py:596]     placeholders[is_embed] = embeds
(VllmWorker TP3 pid=335467) ERROR 08-19 21:57:38 [multiproc_executor.py:596] RuntimeError: shape mismatch: value tensor of shape [4032, 4096] cannot be broadcast to indexing result of shape [5760, 4096]
(VllmWorker TP2 pid=335466) ERROR 08-19 21:57:38 [multiproc_executor.py:596] WorkerProc hit an exception.
(VllmWorker TP2 pid=335466) ERROR 08-19 21:57:38 [multiproc_executor.py:596] Traceback (most recent call last):
(VllmWorker TP2 pid=335466) ERROR 08-19 21:57:38 [multiproc_executor.py:596]   File "/usr/local/lib/python3.10/dist-packages/vllm/v1/executor/multiproc_executor.py", line 591, in worker_busy_loop
(VllmWorker TP2 pid=335466) ERROR 08-19 21:57:38 [multiproc_executor.py:596]     output = func(*args, **kwargs)
(VllmWorker TP2 pid=335466) ERROR 08-19 21:57:38 [multiproc_executor.py:596]   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorker TP2 pid=335466) ERROR 08-19 21:57:38 [multiproc_executor.py:596]     return func(*args, **kwargs)
(VllmWorker TP2 pid=335466) ERROR 08-19 21:57:38 [multiproc_executor.py:596]   File "/usr/local/lib/python3.10/dist-packages/vllm/v1/worker/gpu_worker.py", line 362, in execute_model
(VllmWorker TP2 pid=335466) ERROR 08-19 21:57:38 [multiproc_executor.py:596]     output = self.model_runner.execute_model(scheduler_output,
(VllmWorker TP2 pid=335466) ERROR 08-19 21:57:38 [multiproc_executor.py:596]   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorker TP2 pid=335466) ERROR 08-19 21:57:38 [multiproc_executor.py:596]     return func(*args, **kwargs)
(VllmWorker TP2 pid=335466) ERROR 08-19 21:57:38 [multiproc_executor.py:596]   File "/usr/local/lib/python3.10/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 1550, in execute_model
(VllmWorker TP2 pid=335466) ERROR 08-19 21:57:38 [multiproc_executor.py:596]     self._execute_mm_encoder(scheduler_output)
(VllmWorker TP2 pid=335466) ERROR 08-19 21:57:38 [multiproc_executor.py:596]   File "/usr/local/lib/python3.10/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 1213, in _execute_mm_encoder
(VllmWorker TP2 pid=335466) ERROR 08-19 21:57:38 [multiproc_executor.py:596]     self.encoder_cache[req_id][input_id] = scatter_mm_placeholders(
(VllmWorker TP2 pid=335466) ERROR 08-19 21:57:38 [multiproc_executor.py:596]   File "/usr/local/lib/python3.10/dist-packages/vllm/v1/worker/utils.py", line 186, in scatter_mm_placeholders
(VllmWorker TP2 pid=335466) ERROR 08-19 21:57:38 [multiproc_executor.py:596]     placeholders[is_embed] = embeds
(VllmWorker TP2 pid=335466) ERROR 08-19 21:57:38 [multiproc_executor.py:596] RuntimeError: shape mismatch: value tensor of shape [4032, 4096] cannot be broadcast to indexing result of shape [5760, 4096]
(VllmWorker TP1 pid=335465) ERROR 08-19 21:57:38 [multiproc_executor.py:596] WorkerProc hit an exception.
(VllmWorker TP1 pid=335465) ERROR 08-19 21:57:38 [multiproc_executor.py:596] Traceback (most recent call last):
(VllmWorker TP1 pid=335465) ERROR 08-19 21:57:38 [multiproc_executor.py:596]   File "/usr/local/lib/python3.10/dist-packages/vllm/v1/executor/multiproc_executor.py", line 591, in worker_busy_loop
(VllmWorker TP1 pid=335465) ERROR 08-19 21:57:38 [multiproc_executor.py:596]     output = func(*args, **kwargs)
(VllmWorker TP1 pid=335465) ERROR 08-19 21:57:38 [multiproc_executor.py:596]   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorker TP1 pid=335465) ERROR 08-19 21:57:38 [multiproc_executor.py:596]     return func(*args, **kwargs)
(VllmWorker TP1 pid=335465) ERROR 08-19 21:57:38 [multiproc_executor.py:596]   File "/usr/local/lib/python3.10/dist-packages/vllm/v1/worker/gpu_worker.py", line 362, in execute_model
(VllmWorker TP0 pid=335464) ERROR 08-19 21:57:38 [multiproc_executor.py:596] WorkerProc hit an exception.
(VllmWorker TP1 pid=335465) ERROR 08-19 21:57:38 [multiproc_executor.py:596]     output = self.model_runner.execute_model(scheduler_output,
(VllmWorker TP1 pid=335465) ERROR 08-19 21:57:38 [multiproc_executor.py:596]   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorker TP0 pid=335464) ERROR 08-19 21:57:38 [multiproc_executor.py:596] Traceback (most recent call last):
(VllmWorker TP1 pid=335465) ERROR 08-19 21:57:38 [multiproc_executor.py:596]     return func(*args, **kwargs)
(VllmWorker TP0 pid=335464) ERROR 08-19 21:57:38 [multiproc_executor.py:596]   File "/usr/local/lib/python3.10/dist-packages/vllm/v1/executor/multiproc_executor.py", line 591, in worker_busy_loop
(VllmWorker TP1 pid=335465) ERROR 08-19 21:57:38 [multiproc_executor.py:596]   File "/usr/local/lib/python3.10/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 1550, in execute_model
(VllmWorker TP0 pid=335464) ERROR 08-19 21:57:38 [multiproc_executor.py:596]     output = func(*args, **kwargs)
(VllmWorker TP1 pid=335465) ERROR 08-19 21:57:38 [multiproc_executor.py:596]     self._execute_mm_encoder(scheduler_output)
(VllmWorker TP0 pid=335464) ERROR 08-19 21:57:38 [multiproc_executor.py:596]   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorker TP1 pid=335465) ERROR 08-19 21:57:38 [multiproc_executor.py:596]   File "/usr/local/lib/python3.10/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 1213, in _execute_mm_encoder
(VllmWorker TP0 pid=335464) ERROR 08-19 21:57:38 [multiproc_executor.py:596]     return func(*args, **kwargs)
(VllmWorker TP1 pid=335465) ERROR 08-19 21:57:38 [multiproc_executor.py:596]     self.encoder_cache[req_id][input_id] = scatter_mm_placeholders(
(VllmWorker TP0 pid=335464) ERROR 08-19 21:57:38 [multiproc_executor.py:596]   File "/usr/local/lib/python3.10/dist-packages/vllm/v1/worker/gpu_worker.py", line 362, in execute_model
(VllmWorker TP1 pid=335465) ERROR 08-19 21:57:38 [multiproc_executor.py:596]   File "/usr/local/lib/python3.10/dist-packages/vllm/v1/worker/utils.py", line 186, in scatter_mm_placeholders
(VllmWorker TP0 pid=335464) ERROR 08-19 21:57:38 [multiproc_executor.py:596]     output = self.model_runner.execute_model(scheduler_output,
(VllmWorker TP1 pid=335465) ERROR 08-19 21:57:38 [multiproc_executor.py:596]     placeholders[is_embed] = embeds
(VllmWorker TP0 pid=335464) ERROR 08-19 21:57:38 [multiproc_executor.py:596]   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorker TP1 pid=335465) ERROR 08-19 21:57:38 [multiproc_executor.py:596] RuntimeError: shape mismatch: value tensor of shape [4032, 4096] cannot be broadcast to indexing result of shape [5760, 4096]
(VllmWorker TP0 pid=335464) ERROR 08-19 21:57:38 [multiproc_executor.py:596]     return func(*args, **kwargs)
(VllmWorker TP0 pid=335464) ERROR 08-19 21:57:38 [multiproc_executor.py:596]   File "/usr/local/lib/python3.10/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 1550, in execute_model
(VllmWorker TP0 pid=335464) ERROR 08-19 21:57:38 [multiproc_executor.py:596]     self._execute_mm_encoder(scheduler_output)
(VllmWorker TP0 pid=335464) ERROR 08-19 21:57:38 [multiproc_executor.py:596]   File "/usr/local/lib/python3.10/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 1213, in _execute_mm_encoder
(VllmWorker TP0 pid=335464) ERROR 08-19 21:57:38 [multiproc_executor.py:596]     self.encoder_cache[req_id][input_id] = scatter_mm_placeholders(
(VllmWorker TP0 pid=335464) ERROR 08-19 21:57:38 [multiproc_executor.py:596]   File "/usr/local/lib/python3.10/dist-packages/vllm/v1/worker/utils.py", line 186, in scatter_mm_placeholders
(VllmWorker TP0 pid=335464) ERROR 08-19 21:57:38 [multiproc_executor.py:596]     placeholders[is_embed] = embeds
(VllmWorker TP0 pid=335464) ERROR 08-19 21:57:38 [multiproc_executor.py:596] RuntimeError: shape mismatch: value tensor of shape [4032, 4096] cannot be broadcast to indexing result of shape [5760, 4096]
(EngineCore_0 pid=335330) ERROR 08-19 21:57:38 [dump_input.py:69] Dumping input data for V1 LLM engine (v0.10.1) with config: model='GLM-4.5V', speculative_config=None, tokenizer='GLM-4.5V', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=65536, download_dir=None, load_format=auto, tensor_parallel_size=4, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=GLM-4.5V, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, pooler_config=None, compilation_config={"level":3,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":[],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output","vllm.mamba_mixer2"],"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"cudagraph_mode":1,"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[512,504,496,488,480,472,464,456,448,440,432,424,416,408,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"pass_config":{},"max_capture_size":512,"local_cache_dir":null}, 
(EngineCore_0 pid=335330) ERROR 08-19 21:57:38 [dump_input.py:76] Dumping scheduler output for model execution: SchedulerOutput(scheduled_new_reqs=[NewRequestData(req_id=chatcmpl-fb31cdc5b5a24f6e8c4440e439289a31,prompt_token_ids_len=5804,mm_kwargs=[{'video_grid_thw': MultiModalFieldElem(modality='video', key='video_grid_thw', data=tensor([ 7, 36, 64]), field=MultiModalBatchedField()), 'pixel_values_videos': MultiModalFieldElem(modality='video', key='pixel_values_videos', data=tensor([[-0.6094, -0.6680, -0.4922,  ...,  0.7383,  0.9102,  0.3535],
(EngineCore_0 pid=335330) ERROR 08-19 21:57:38 [dump_input.py:76]         [-0.6992, -0.7109, -0.7109,  ...,  1.9297,  1.5469,  1.6328],
(EngineCore_0 pid=335330) ERROR 08-19 21:57:38 [dump_input.py:76]         [-0.6094, -1.0078,  0.6445,  ...,  1.5234,  1.4922,  1.4766],
(EngineCore_0 pid=335330) ERROR 08-19 21:57:38 [dump_input.py:76]         ...,
(EngineCore_0 pid=335330) ERROR 08-19 21:57:38 [dump_input.py:76]         [ 1.0859,  1.1016,  1.0859,  ...,  0.2695,  0.4824,  0.3828],
(EngineCore_0 pid=335330) ERROR 08-19 21:57:38 [dump_input.py:76]         [ 1.6641,  0.7344,  0.5156,  ...,  0.2969,  0.2832,  0.2969],
(EngineCore_0 pid=335330) ERROR 08-19 21:57:38 [dump_input.py:76]         [ 0.3105,  0.8945,  1.6250,  ...,  0.3262,  0.3965,  0.3398]],
(EngineCore_0 pid=335330) ERROR 08-19 21:57:38 [dump_input.py:76]        dtype=torch.bfloat16), field=MultiModalFlatField(slices=[[slice(0, 16128, None)]], dim=0))}],mm_hashes=['a07ffcd400f73825081ffb78773726944676f96a54da45c355875a860c64b4b4'],mm_positions=[PlaceholderRange(offset=10, length=5792, is_embed=tensor([False, False,  True,  ..., False, False, False]))],sampling_params=SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.6, top_p=0.0001, top_k=1, min_p=0.0, seed=None, stop=[], stop_token_ids=[151336, 151338], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=16384, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None, extra_args=None),block_ids=([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207, 208, 209, 210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 220, 221, 222, 223, 224, 225, 226, 227, 228, 229, 230, 231, 232, 233, 234, 235, 236, 237, 238, 239, 240, 241, 242, 243, 244, 245, 246, 247, 248, 249, 250, 251, 252, 253, 254, 260, 261, 262, 263, 264, 265, 266, 267, 268, 269, 270, 271, 272, 273, 274, 275, 276, 277, 278, 279, 280, 281, 282, 283, 284, 285, 286, 287, 288, 289, 290, 291, 292, 293, 294, 295, 296, 297, 298, 299, 300, 301, 302, 303, 304, 305, 306, 307, 308, 309, 310, 311, 312, 313, 314, 315, 316, 317, 318, 319, 320, 321, 322, 323, 324, 325, 326, 327, 328, 329, 330, 331, 332, 333, 334, 335, 336, 337, 338, 339, 340, 341, 342, 343, 344, 345, 346, 347, 348, 349, 350, 351, 352, 353, 354, 355, 356, 357, 358, 359, 360, 361, 362, 363, 364, 365, 366, 367, 368],),num_computed_tokens=4064,lora_request=None)], scheduled_cached_reqs=CachedRequestData(req_ids=[], resumed_from_preemption=[], new_token_ids=[], new_block_ids=[], num_computed_tokens=[]), num_scheduled_tokens={chatcmpl-fb31cdc5b5a24f6e8c4440e439289a31: 1740}, total_num_scheduled_tokens=1740, scheduled_spec_decode_tokens={}, scheduled_encoder_inputs={chatcmpl-fb31cdc5b5a24f6e8c4440e439289a31: [0]}, num_common_prefix_blocks=[363], finished_req_ids=[], free_encoder_input_ids=[], structured_output_request_ids={}, grammar_bitmask=null, kv_connector_metadata=null)
(EngineCore_0 pid=335330) ERROR 08-19 21:57:38 [dump_input.py:79] Dumping scheduler stats: SchedulerStats(num_running_reqs=1, num_waiting_reqs=0, step_counter=0, current_wave=0, kv_cache_usage=0.008323234170992122, prefix_cache_stats=PrefixCacheStats(reset=False, requests=1, queries=5804, hits=4064), spec_decoding_stats=None, num_corrupted_reqs=0)
(EngineCore_0 pid=335330) ERROR 08-19 21:57:38 [core.py:702] EngineCore encountered a fatal error.
(EngineCore_0 pid=335330) ERROR 08-19 21:57:38 [core.py:702] Traceback (most recent call last):
(EngineCore_0 pid=335330) ERROR 08-19 21:57:38 [core.py:702]   File "/usr/local/lib/python3.10/dist-packages/vllm/v1/engine/core.py", line 693, in run_engine_core
(EngineCore_0 pid=335330) ERROR 08-19 21:57:38 [core.py:702]     engine_core.run_busy_loop()
(EngineCore_0 pid=335330) ERROR 08-19 21:57:38 [core.py:702]   File "/usr/local/lib/python3.10/dist-packages/vllm/v1/engine/core.py", line 720, in run_busy_loop
(EngineCore_0 pid=335330) ERROR 08-19 21:57:38 [core.py:702]     self._process_engine_step()
(EngineCore_0 pid=335330) ERROR 08-19 21:57:38 [core.py:702]   File "/usr/local/lib/python3.10/dist-packages/vllm/v1/engine/core.py", line 745, in _process_engine_step
(EngineCore_0 pid=335330) ERROR 08-19 21:57:38 [core.py:702]     outputs, model_executed = self.step_fn()
(EngineCore_0 pid=335330) ERROR 08-19 21:57:38 [core.py:702]   File "/usr/local/lib/python3.10/dist-packages/vllm/v1/engine/core.py", line 288, in step
(EngineCore_0 pid=335330) ERROR 08-19 21:57:38 [core.py:702]     model_output = self.execute_model_with_error_logging(
(EngineCore_0 pid=335330) ERROR 08-19 21:57:38 [core.py:702]   File "/usr/local/lib/python3.10/dist-packages/vllm/v1/engine/core.py", line 274, in execute_model_with_error_logging
(EngineCore_0 pid=335330) ERROR 08-19 21:57:38 [core.py:702]     raise err
(EngineCore_0 pid=335330) ERROR 08-19 21:57:38 [core.py:702]   File "/usr/local/lib/python3.10/dist-packages/vllm/v1/engine/core.py", line 265, in execute_model_with_error_logging
(EngineCore_0 pid=335330) ERROR 08-19 21:57:38 [core.py:702]     return model_fn(scheduler_output)
(EngineCore_0 pid=335330) ERROR 08-19 21:57:38 [core.py:702]   File "/usr/local/lib/python3.10/dist-packages/vllm/v1/executor/multiproc_executor.py", line 173, in execute_model
(EngineCore_0 pid=335330) ERROR 08-19 21:57:38 [core.py:702]     (output, ) = self.collective_rpc(
(EngineCore_0 pid=335330) ERROR 08-19 21:57:38 [core.py:702]   File "/usr/local/lib/python3.10/dist-packages/vllm/v1/executor/multiproc_executor.py", line 243, in collective_rpc
(EngineCore_0 pid=335330) ERROR 08-19 21:57:38 [core.py:702]     result = get_response(w, dequeue_timeout)
(EngineCore_0 pid=335330) ERROR 08-19 21:57:38 [core.py:702]   File "/usr/local/lib/python3.10/dist-packages/vllm/v1/executor/multiproc_executor.py", line 230, in get_response
(EngineCore_0 pid=335330) ERROR 08-19 21:57:38 [core.py:702]     raise RuntimeError(
(EngineCore_0 pid=335330) ERROR 08-19 21:57:38 [core.py:702] RuntimeError: Worker failed with error 'shape mismatch: value tensor of shape [4032, 4096] cannot be broadcast to indexing result of shape [5760, 4096]', please check the stack trace above for the root cause
(VllmWorker TP0 pid=335464) INFO 08-19 21:57:38 [multiproc_executor.py:520] Parent process exited, terminating worker
(APIServer pid=335067) ERROR 08-19 21:57:38 [async_llm.py:430] AsyncLLM output_handler failed.
(APIServer pid=335067) ERROR 08-19 21:57:38 [async_llm.py:430] Traceback (most recent call last):
(APIServer pid=335067) ERROR 08-19 21:57:38 [async_llm.py:430]   File "/usr/local/lib/python3.10/dist-packages/vllm/v1/engine/async_llm.py", line 389, in output_handler
(APIServer pid=335067) ERROR 08-19 21:57:38 [async_llm.py:430]     outputs = await engine_core.get_output_async()
(APIServer pid=335067) ERROR 08-19 21:57:38 [async_llm.py:430]   File "/usr/local/lib/python3.10/dist-packages/vllm/v1/engine/core_client.py", line 843, in get_output_async
(APIServer pid=335067) ERROR 08-19 21:57:38 [async_llm.py:430]     raise self._format_exception(outputs) from None
(APIServer pid=335067) ERROR 08-19 21:57:38 [async_llm.py:430] vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause.
(VllmWorker TP1 pid=335465) INFO 08-19 21:57:38 [multiproc_executor.py:520] Parent process exited, terminating worker
(VllmWorker TP2 pid=335466) INFO 08-19 21:57:38 [multiproc_executor.py:520] Parent process exited, terminating worker
(APIServer pid=335067) INFO:     127.0.0.1:56284 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
(VllmWorker TP3 pid=335467) INFO 08-19 21:57:38 [multiproc_executor.py:520] Parent process exited, terminating worker
(APIServer pid=335067) INFO:     Shutting down
(APIServer pid=335067) INFO:     Waiting for application shutdown.
(APIServer pid=335067) INFO:     Application shutdown complete.
(APIServer pid=335067) INFO:     Finished server process [335067]
```

I am wondering if I want to send video to glm, how should I do it?
I have long videos and want to chunk them into 10s 5FPS chunks to glm, but seems not succeed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

vllm video inference error #150

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

vllm video inference error #150

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions