-
Notifications
You must be signed in to change notification settings - Fork 76
Closed
Description
vllm: 10.0+torch 2.7.1+cuda12.6
transformers: 4.56.0dev0
command: vllm serve GLM-4.5V -tp 4
if __name__ == "__main__":
video_path = "test.mp4"
# video of 10s, 5FPS
video_base64 = process_video_to_base64(video_path)
if not video_base64:
exit(1)
# 准备消息内容
user_message = "What's in this video?"
text_tokens = count_tokens(user_message)
video_tokens = estimate_video_tokens(video_base64)
print(f"\n输入文本token数: {text_tokens}")
print(f"视频数据token数: {video_tokens}")
print(f"预估总输入token数: {text_tokens + video_tokens}")
# 发送请求
response = client.chat.completions.create(
model=model,
messages=[{
"role": "user",
"content": [
{"type": "text", "text": user_message},
{"type": "video_url", "video_url": {"url": f"data:video/mp4;base64,{video_base64}"}}
]
}],
max_tokens=16384,
extra_body={
"media_io_kwargs": {
"video": {
"num_frames": 50
}
}
},
temperature=0.6
)
# 获取响应内容
response_content = response.choices[0].message.content
# 计算token统计
prompt_tokens = response.usage.prompt_tokens if hasattr(response, 'usage') else text_tokens + video_tokens
completion_tokens = response.usage.completion_tokens if hasattr(response, 'usage') else count_tokens(response_content)
total_tokens = response.usage.total_tokens if hasattr(response, 'usage') else prompt_tokens + completion_tokens
# 打印统计信息
print_token_stats(prompt_tokens, completion_tokens, total_tokens)
# 打印响应内容
print("\n响应内容:")
print("-" * 50)
print(response_content)
I raned the script twice. First time the inference was fine but token count is significat lower tha it should be:
成功处理视频,数据大小: 4502212 字节
视频token估算: 50帧 × 576.00 tokens/帧 = 28,800 tokens
输入文本token数: 6
视频数据token数: 28800
预估总输入token数: 28806
==================================================
TOKEN统计信息
==================================================
输入token数: 4,067
输出token数: 65
总token数: 4,132
估算成本: $0.0426
==================================================
And in the server side I noticed:
(APIServer pid=335067) WARNING 08-19 21:56:02 [glm4_1v.py:1095] Total frames in metadata (50) does not match the length of video array 32. This can be because the video is resampled in advance. This may cause a divergence with HF implementation.
Then I run the same script the second time got:
(APIServer pid=335067) WARNING 08-19 21:57:38 [protocol.py:81] The following fields were present in the request but ignored: {'media_io_kwargs'}
(VllmWorker TP3 pid=335467) ERROR 08-19 21:57:38 [multiproc_executor.py:596] WorkerProc hit an exception.
(VllmWorker TP3 pid=335467) ERROR 08-19 21:57:38 [multiproc_executor.py:596] Traceback (most recent call last):
(VllmWorker TP3 pid=335467) ERROR 08-19 21:57:38 [multiproc_executor.py:596] File "/usr/local/lib/python3.10/dist-packages/vllm/v1/executor/multiproc_executor.py", line 591, in worker_busy_loop
(VllmWorker TP3 pid=335467) ERROR 08-19 21:57:38 [multiproc_executor.py:596] output = func(*args, **kwargs)
(VllmWorker TP3 pid=335467) ERROR 08-19 21:57:38 [multiproc_executor.py:596] File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorker TP3 pid=335467) ERROR 08-19 21:57:38 [multiproc_executor.py:596] return func(*args, **kwargs)
(VllmWorker TP3 pid=335467) ERROR 08-19 21:57:38 [multiproc_executor.py:596] File "/usr/local/lib/python3.10/dist-packages/vllm/v1/worker/gpu_worker.py", line 362, in execute_model
(VllmWorker TP3 pid=335467) ERROR 08-19 21:57:38 [multiproc_executor.py:596] output = self.model_runner.execute_model(scheduler_output,
(VllmWorker TP3 pid=335467) ERROR 08-19 21:57:38 [multiproc_executor.py:596] File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorker TP3 pid=335467) ERROR 08-19 21:57:38 [multiproc_executor.py:596] return func(*args, **kwargs)
(VllmWorker TP3 pid=335467) ERROR 08-19 21:57:38 [multiproc_executor.py:596] File "/usr/local/lib/python3.10/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 1550, in execute_model
(VllmWorker TP3 pid=335467) ERROR 08-19 21:57:38 [multiproc_executor.py:596] self._execute_mm_encoder(scheduler_output)
(VllmWorker TP3 pid=335467) ERROR 08-19 21:57:38 [multiproc_executor.py:596] File "/usr/local/lib/python3.10/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 1213, in _execute_mm_encoder
(VllmWorker TP3 pid=335467) ERROR 08-19 21:57:38 [multiproc_executor.py:596] self.encoder_cache[req_id][input_id] = scatter_mm_placeholders(
(VllmWorker TP3 pid=335467) ERROR 08-19 21:57:38 [multiproc_executor.py:596] File "/usr/local/lib/python3.10/dist-packages/vllm/v1/worker/utils.py", line 186, in scatter_mm_placeholders
(VllmWorker TP3 pid=335467) ERROR 08-19 21:57:38 [multiproc_executor.py:596] placeholders[is_embed] = embeds
(VllmWorker TP3 pid=335467) ERROR 08-19 21:57:38 [multiproc_executor.py:596] RuntimeError: shape mismatch: value tensor of shape [4032, 4096] cannot be broadcast to indexing result of shape [5760, 4096]
(VllmWorker TP2 pid=335466) ERROR 08-19 21:57:38 [multiproc_executor.py:596] WorkerProc hit an exception.
(VllmWorker TP2 pid=335466) ERROR 08-19 21:57:38 [multiproc_executor.py:596] Traceback (most recent call last):
(VllmWorker TP2 pid=335466) ERROR 08-19 21:57:38 [multiproc_executor.py:596] File "/usr/local/lib/python3.10/dist-packages/vllm/v1/executor/multiproc_executor.py", line 591, in worker_busy_loop
(VllmWorker TP2 pid=335466) ERROR 08-19 21:57:38 [multiproc_executor.py:596] output = func(*args, **kwargs)
(VllmWorker TP2 pid=335466) ERROR 08-19 21:57:38 [multiproc_executor.py:596] File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorker TP2 pid=335466) ERROR 08-19 21:57:38 [multiproc_executor.py:596] return func(*args, **kwargs)
(VllmWorker TP2 pid=335466) ERROR 08-19 21:57:38 [multiproc_executor.py:596] File "/usr/local/lib/python3.10/dist-packages/vllm/v1/worker/gpu_worker.py", line 362, in execute_model
(VllmWorker TP2 pid=335466) ERROR 08-19 21:57:38 [multiproc_executor.py:596] output = self.model_runner.execute_model(scheduler_output,
(VllmWorker TP2 pid=335466) ERROR 08-19 21:57:38 [multiproc_executor.py:596] File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorker TP2 pid=335466) ERROR 08-19 21:57:38 [multiproc_executor.py:596] return func(*args, **kwargs)
(VllmWorker TP2 pid=335466) ERROR 08-19 21:57:38 [multiproc_executor.py:596] File "/usr/local/lib/python3.10/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 1550, in execute_model
(VllmWorker TP2 pid=335466) ERROR 08-19 21:57:38 [multiproc_executor.py:596] self._execute_mm_encoder(scheduler_output)
(VllmWorker TP2 pid=335466) ERROR 08-19 21:57:38 [multiproc_executor.py:596] File "/usr/local/lib/python3.10/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 1213, in _execute_mm_encoder
(VllmWorker TP2 pid=335466) ERROR 08-19 21:57:38 [multiproc_executor.py:596] self.encoder_cache[req_id][input_id] = scatter_mm_placeholders(
(VllmWorker TP2 pid=335466) ERROR 08-19 21:57:38 [multiproc_executor.py:596] File "/usr/local/lib/python3.10/dist-packages/vllm/v1/worker/utils.py", line 186, in scatter_mm_placeholders
(VllmWorker TP2 pid=335466) ERROR 08-19 21:57:38 [multiproc_executor.py:596] placeholders[is_embed] = embeds
(VllmWorker TP2 pid=335466) ERROR 08-19 21:57:38 [multiproc_executor.py:596] RuntimeError: shape mismatch: value tensor of shape [4032, 4096] cannot be broadcast to indexing result of shape [5760, 4096]
(VllmWorker TP1 pid=335465) ERROR 08-19 21:57:38 [multiproc_executor.py:596] WorkerProc hit an exception.
(VllmWorker TP1 pid=335465) ERROR 08-19 21:57:38 [multiproc_executor.py:596] Traceback (most recent call last):
(VllmWorker TP1 pid=335465) ERROR 08-19 21:57:38 [multiproc_executor.py:596] File "/usr/local/lib/python3.10/dist-packages/vllm/v1/executor/multiproc_executor.py", line 591, in worker_busy_loop
(VllmWorker TP1 pid=335465) ERROR 08-19 21:57:38 [multiproc_executor.py:596] output = func(*args, **kwargs)
(VllmWorker TP1 pid=335465) ERROR 08-19 21:57:38 [multiproc_executor.py:596] File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorker TP1 pid=335465) ERROR 08-19 21:57:38 [multiproc_executor.py:596] return func(*args, **kwargs)
(VllmWorker TP1 pid=335465) ERROR 08-19 21:57:38 [multiproc_executor.py:596] File "/usr/local/lib/python3.10/dist-packages/vllm/v1/worker/gpu_worker.py", line 362, in execute_model
(VllmWorker TP0 pid=335464) ERROR 08-19 21:57:38 [multiproc_executor.py:596] WorkerProc hit an exception.
(VllmWorker TP1 pid=335465) ERROR 08-19 21:57:38 [multiproc_executor.py:596] output = self.model_runner.execute_model(scheduler_output,
(VllmWorker TP1 pid=335465) ERROR 08-19 21:57:38 [multiproc_executor.py:596] File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorker TP0 pid=335464) ERROR 08-19 21:57:38 [multiproc_executor.py:596] Traceback (most recent call last):
(VllmWorker TP1 pid=335465) ERROR 08-19 21:57:38 [multiproc_executor.py:596] return func(*args, **kwargs)
(VllmWorker TP0 pid=335464) ERROR 08-19 21:57:38 [multiproc_executor.py:596] File "/usr/local/lib/python3.10/dist-packages/vllm/v1/executor/multiproc_executor.py", line 591, in worker_busy_loop
(VllmWorker TP1 pid=335465) ERROR 08-19 21:57:38 [multiproc_executor.py:596] File "/usr/local/lib/python3.10/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 1550, in execute_model
(VllmWorker TP0 pid=335464) ERROR 08-19 21:57:38 [multiproc_executor.py:596] output = func(*args, **kwargs)
(VllmWorker TP1 pid=335465) ERROR 08-19 21:57:38 [multiproc_executor.py:596] self._execute_mm_encoder(scheduler_output)
(VllmWorker TP0 pid=335464) ERROR 08-19 21:57:38 [multiproc_executor.py:596] File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorker TP1 pid=335465) ERROR 08-19 21:57:38 [multiproc_executor.py:596] File "/usr/local/lib/python3.10/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 1213, in _execute_mm_encoder
(VllmWorker TP0 pid=335464) ERROR 08-19 21:57:38 [multiproc_executor.py:596] return func(*args, **kwargs)
(VllmWorker TP1 pid=335465) ERROR 08-19 21:57:38 [multiproc_executor.py:596] self.encoder_cache[req_id][input_id] = scatter_mm_placeholders(
(VllmWorker TP0 pid=335464) ERROR 08-19 21:57:38 [multiproc_executor.py:596] File "/usr/local/lib/python3.10/dist-packages/vllm/v1/worker/gpu_worker.py", line 362, in execute_model
(VllmWorker TP1 pid=335465) ERROR 08-19 21:57:38 [multiproc_executor.py:596] File "/usr/local/lib/python3.10/dist-packages/vllm/v1/worker/utils.py", line 186, in scatter_mm_placeholders
(VllmWorker TP0 pid=335464) ERROR 08-19 21:57:38 [multiproc_executor.py:596] output = self.model_runner.execute_model(scheduler_output,
(VllmWorker TP1 pid=335465) ERROR 08-19 21:57:38 [multiproc_executor.py:596] placeholders[is_embed] = embeds
(VllmWorker TP0 pid=335464) ERROR 08-19 21:57:38 [multiproc_executor.py:596] File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorker TP1 pid=335465) ERROR 08-19 21:57:38 [multiproc_executor.py:596] RuntimeError: shape mismatch: value tensor of shape [4032, 4096] cannot be broadcast to indexing result of shape [5760, 4096]
(VllmWorker TP0 pid=335464) ERROR 08-19 21:57:38 [multiproc_executor.py:596] return func(*args, **kwargs)
(VllmWorker TP0 pid=335464) ERROR 08-19 21:57:38 [multiproc_executor.py:596] File "/usr/local/lib/python3.10/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 1550, in execute_model
(VllmWorker TP0 pid=335464) ERROR 08-19 21:57:38 [multiproc_executor.py:596] self._execute_mm_encoder(scheduler_output)
(VllmWorker TP0 pid=335464) ERROR 08-19 21:57:38 [multiproc_executor.py:596] File "/usr/local/lib/python3.10/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 1213, in _execute_mm_encoder
(VllmWorker TP0 pid=335464) ERROR 08-19 21:57:38 [multiproc_executor.py:596] self.encoder_cache[req_id][input_id] = scatter_mm_placeholders(
(VllmWorker TP0 pid=335464) ERROR 08-19 21:57:38 [multiproc_executor.py:596] File "/usr/local/lib/python3.10/dist-packages/vllm/v1/worker/utils.py", line 186, in scatter_mm_placeholders
(VllmWorker TP0 pid=335464) ERROR 08-19 21:57:38 [multiproc_executor.py:596] placeholders[is_embed] = embeds
(VllmWorker TP0 pid=335464) ERROR 08-19 21:57:38 [multiproc_executor.py:596] RuntimeError: shape mismatch: value tensor of shape [4032, 4096] cannot be broadcast to indexing result of shape [5760, 4096]
(EngineCore_0 pid=335330) ERROR 08-19 21:57:38 [dump_input.py:69] Dumping input data for V1 LLM engine (v0.10.1) with config: model='GLM-4.5V', speculative_config=None, tokenizer='GLM-4.5V', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=65536, download_dir=None, load_format=auto, tensor_parallel_size=4, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=GLM-4.5V, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, pooler_config=None, compilation_config={"level":3,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":[],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output","vllm.mamba_mixer2"],"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"cudagraph_mode":1,"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[512,504,496,488,480,472,464,456,448,440,432,424,416,408,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"pass_config":{},"max_capture_size":512,"local_cache_dir":null},
(EngineCore_0 pid=335330) ERROR 08-19 21:57:38 [dump_input.py:76] Dumping scheduler output for model execution: SchedulerOutput(scheduled_new_reqs=[NewRequestData(req_id=chatcmpl-fb31cdc5b5a24f6e8c4440e439289a31,prompt_token_ids_len=5804,mm_kwargs=[{'video_grid_thw': MultiModalFieldElem(modality='video', key='video_grid_thw', data=tensor([ 7, 36, 64]), field=MultiModalBatchedField()), 'pixel_values_videos': MultiModalFieldElem(modality='video', key='pixel_values_videos', data=tensor([[-0.6094, -0.6680, -0.4922, ..., 0.7383, 0.9102, 0.3535],
(EngineCore_0 pid=335330) ERROR 08-19 21:57:38 [dump_input.py:76] [-0.6992, -0.7109, -0.7109, ..., 1.9297, 1.5469, 1.6328],
(EngineCore_0 pid=335330) ERROR 08-19 21:57:38 [dump_input.py:76] [-0.6094, -1.0078, 0.6445, ..., 1.5234, 1.4922, 1.4766],
(EngineCore_0 pid=335330) ERROR 08-19 21:57:38 [dump_input.py:76] ...,
(EngineCore_0 pid=335330) ERROR 08-19 21:57:38 [dump_input.py:76] [ 1.0859, 1.1016, 1.0859, ..., 0.2695, 0.4824, 0.3828],
(EngineCore_0 pid=335330) ERROR 08-19 21:57:38 [dump_input.py:76] [ 1.6641, 0.7344, 0.5156, ..., 0.2969, 0.2832, 0.2969],
(EngineCore_0 pid=335330) ERROR 08-19 21:57:38 [dump_input.py:76] [ 0.3105, 0.8945, 1.6250, ..., 0.3262, 0.3965, 0.3398]],
(EngineCore_0 pid=335330) ERROR 08-19 21:57:38 [dump_input.py:76] dtype=torch.bfloat16), field=MultiModalFlatField(slices=[[slice(0, 16128, None)]], dim=0))}],mm_hashes=['a07ffcd400f73825081ffb78773726944676f96a54da45c355875a860c64b4b4'],mm_positions=[PlaceholderRange(offset=10, length=5792, is_embed=tensor([False, False, True, ..., False, False, False]))],sampling_params=SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.6, top_p=0.0001, top_k=1, min_p=0.0, seed=None, stop=[], stop_token_ids=[151336, 151338], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=16384, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None, extra_args=None),block_ids=([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207, 208, 209, 210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 220, 221, 222, 223, 224, 225, 226, 227, 228, 229, 230, 231, 232, 233, 234, 235, 236, 237, 238, 239, 240, 241, 242, 243, 244, 245, 246, 247, 248, 249, 250, 251, 252, 253, 254, 260, 261, 262, 263, 264, 265, 266, 267, 268, 269, 270, 271, 272, 273, 274, 275, 276, 277, 278, 279, 280, 281, 282, 283, 284, 285, 286, 287, 288, 289, 290, 291, 292, 293, 294, 295, 296, 297, 298, 299, 300, 301, 302, 303, 304, 305, 306, 307, 308, 309, 310, 311, 312, 313, 314, 315, 316, 317, 318, 319, 320, 321, 322, 323, 324, 325, 326, 327, 328, 329, 330, 331, 332, 333, 334, 335, 336, 337, 338, 339, 340, 341, 342, 343, 344, 345, 346, 347, 348, 349, 350, 351, 352, 353, 354, 355, 356, 357, 358, 359, 360, 361, 362, 363, 364, 365, 366, 367, 368],),num_computed_tokens=4064,lora_request=None)], scheduled_cached_reqs=CachedRequestData(req_ids=[], resumed_from_preemption=[], new_token_ids=[], new_block_ids=[], num_computed_tokens=[]), num_scheduled_tokens={chatcmpl-fb31cdc5b5a24f6e8c4440e439289a31: 1740}, total_num_scheduled_tokens=1740, scheduled_spec_decode_tokens={}, scheduled_encoder_inputs={chatcmpl-fb31cdc5b5a24f6e8c4440e439289a31: [0]}, num_common_prefix_blocks=[363], finished_req_ids=[], free_encoder_input_ids=[], structured_output_request_ids={}, grammar_bitmask=null, kv_connector_metadata=null)
(EngineCore_0 pid=335330) ERROR 08-19 21:57:38 [dump_input.py:79] Dumping scheduler stats: SchedulerStats(num_running_reqs=1, num_waiting_reqs=0, step_counter=0, current_wave=0, kv_cache_usage=0.008323234170992122, prefix_cache_stats=PrefixCacheStats(reset=False, requests=1, queries=5804, hits=4064), spec_decoding_stats=None, num_corrupted_reqs=0)
(EngineCore_0 pid=335330) ERROR 08-19 21:57:38 [core.py:702] EngineCore encountered a fatal error.
(EngineCore_0 pid=335330) ERROR 08-19 21:57:38 [core.py:702] Traceback (most recent call last):
(EngineCore_0 pid=335330) ERROR 08-19 21:57:38 [core.py:702] File "/usr/local/lib/python3.10/dist-packages/vllm/v1/engine/core.py", line 693, in run_engine_core
(EngineCore_0 pid=335330) ERROR 08-19 21:57:38 [core.py:702] engine_core.run_busy_loop()
(EngineCore_0 pid=335330) ERROR 08-19 21:57:38 [core.py:702] File "/usr/local/lib/python3.10/dist-packages/vllm/v1/engine/core.py", line 720, in run_busy_loop
(EngineCore_0 pid=335330) ERROR 08-19 21:57:38 [core.py:702] self._process_engine_step()
(EngineCore_0 pid=335330) ERROR 08-19 21:57:38 [core.py:702] File "/usr/local/lib/python3.10/dist-packages/vllm/v1/engine/core.py", line 745, in _process_engine_step
(EngineCore_0 pid=335330) ERROR 08-19 21:57:38 [core.py:702] outputs, model_executed = self.step_fn()
(EngineCore_0 pid=335330) ERROR 08-19 21:57:38 [core.py:702] File "/usr/local/lib/python3.10/dist-packages/vllm/v1/engine/core.py", line 288, in step
(EngineCore_0 pid=335330) ERROR 08-19 21:57:38 [core.py:702] model_output = self.execute_model_with_error_logging(
(EngineCore_0 pid=335330) ERROR 08-19 21:57:38 [core.py:702] File "/usr/local/lib/python3.10/dist-packages/vllm/v1/engine/core.py", line 274, in execute_model_with_error_logging
(EngineCore_0 pid=335330) ERROR 08-19 21:57:38 [core.py:702] raise err
(EngineCore_0 pid=335330) ERROR 08-19 21:57:38 [core.py:702] File "/usr/local/lib/python3.10/dist-packages/vllm/v1/engine/core.py", line 265, in execute_model_with_error_logging
(EngineCore_0 pid=335330) ERROR 08-19 21:57:38 [core.py:702] return model_fn(scheduler_output)
(EngineCore_0 pid=335330) ERROR 08-19 21:57:38 [core.py:702] File "/usr/local/lib/python3.10/dist-packages/vllm/v1/executor/multiproc_executor.py", line 173, in execute_model
(EngineCore_0 pid=335330) ERROR 08-19 21:57:38 [core.py:702] (output, ) = self.collective_rpc(
(EngineCore_0 pid=335330) ERROR 08-19 21:57:38 [core.py:702] File "/usr/local/lib/python3.10/dist-packages/vllm/v1/executor/multiproc_executor.py", line 243, in collective_rpc
(EngineCore_0 pid=335330) ERROR 08-19 21:57:38 [core.py:702] result = get_response(w, dequeue_timeout)
(EngineCore_0 pid=335330) ERROR 08-19 21:57:38 [core.py:702] File "/usr/local/lib/python3.10/dist-packages/vllm/v1/executor/multiproc_executor.py", line 230, in get_response
(EngineCore_0 pid=335330) ERROR 08-19 21:57:38 [core.py:702] raise RuntimeError(
(EngineCore_0 pid=335330) ERROR 08-19 21:57:38 [core.py:702] RuntimeError: Worker failed with error 'shape mismatch: value tensor of shape [4032, 4096] cannot be broadcast to indexing result of shape [5760, 4096]', please check the stack trace above for the root cause
(VllmWorker TP0 pid=335464) INFO 08-19 21:57:38 [multiproc_executor.py:520] Parent process exited, terminating worker
(APIServer pid=335067) ERROR 08-19 21:57:38 [async_llm.py:430] AsyncLLM output_handler failed.
(APIServer pid=335067) ERROR 08-19 21:57:38 [async_llm.py:430] Traceback (most recent call last):
(APIServer pid=335067) ERROR 08-19 21:57:38 [async_llm.py:430] File "/usr/local/lib/python3.10/dist-packages/vllm/v1/engine/async_llm.py", line 389, in output_handler
(APIServer pid=335067) ERROR 08-19 21:57:38 [async_llm.py:430] outputs = await engine_core.get_output_async()
(APIServer pid=335067) ERROR 08-19 21:57:38 [async_llm.py:430] File "/usr/local/lib/python3.10/dist-packages/vllm/v1/engine/core_client.py", line 843, in get_output_async
(APIServer pid=335067) ERROR 08-19 21:57:38 [async_llm.py:430] raise self._format_exception(outputs) from None
(APIServer pid=335067) ERROR 08-19 21:57:38 [async_llm.py:430] vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause.
(VllmWorker TP1 pid=335465) INFO 08-19 21:57:38 [multiproc_executor.py:520] Parent process exited, terminating worker
(VllmWorker TP2 pid=335466) INFO 08-19 21:57:38 [multiproc_executor.py:520] Parent process exited, terminating worker
(APIServer pid=335067) INFO: 127.0.0.1:56284 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
(VllmWorker TP3 pid=335467) INFO 08-19 21:57:38 [multiproc_executor.py:520] Parent process exited, terminating worker
(APIServer pid=335067) INFO: Shutting down
(APIServer pid=335067) INFO: Waiting for application shutdown.
(APIServer pid=335067) INFO: Application shutdown complete.
(APIServer pid=335067) INFO: Finished server process [335067]
I am wondering if I want to send video to glm, how should I do it?
I have long videos and want to chunk them into 10s 5FPS chunks to glm, but seems not succeed.
Metadata
Metadata
Assignees
Labels
No labels