Skip to content

A800 * 4 vLLM Stuck during vLLM deploy #128

@Charmnut

Description

@Charmnut

System Info / 系統信息

使用docker.io/vllm/vllm-openai:v0.10.0镜像
物理机:Driver Version: 535.54.03 CUDA Version: 12.2
容器:Driver Version: 535.54.03 CUDA Version: 12.8
python3:3.12.11

日志(卡在这不动了):

(VllmWorker TP0 pid=1188) INFO 08-13 17:30:51 [__init__.py:1392] Found nccl from library libnccl.so.2
(VllmWorker TP0 pid=1188) INFO 08-13 17:30:51 [pynccl.py:70] vLLM is using nccl==2.26.2
(VllmWorker TP2 pid=1190) INFO 08-13 17:30:51 [__init__.py:1392] Found nccl from library libnccl.so.2
(VllmWorker TP2 pid=1190) INFO 08-13 17:30:51 [pynccl.py:70] vLLM is using nccl==2.26.2
(VllmWorker TP3 pid=1191) INFO 08-13 17:30:51 [__init__.py:1392] Found nccl from library libnccl.so.2
(VllmWorker TP1 pid=1189) INFO 08-13 17:30:51 [__init__.py:1392] Found nccl from library libnccl.so.2
(VllmWorker TP3 pid=1191) INFO 08-13 17:30:51 [pynccl.py:70] vLLM is using nccl==2.26.2
(VllmWorker TP1 pid=1189) INFO 08-13 17:30:51 [pynccl.py:70] vLLM is using nccl==2.26.2
(VllmWorker TP2 pid=1190) WARNING 08-13 17:30:52 [custom_all_reduce.py:137] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
(VllmWorker TP3 pid=1191) WARNING 08-13 17:30:52 [custom_all_reduce.py:137] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
(VllmWorker TP1 pid=1189) WARNING 08-13 17:30:52 [custom_all_reduce.py:137] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
(VllmWorker TP0 pid=1188) WARNING 08-13 17:30:52 [custom_all_reduce.py:137] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
(VllmWorker TP0 pid=1188) INFO 08-13 17:30:52 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[1, 2, 3], buffer_handle=(3, 4194304, 6, 'psm_8a3aacc6'), local_subscribe_addr='ipc:///tmp/da685dc9-aa00-4e77-a981-21a233981933', remote_subscribe_addr=None, remote_addr_ipv6=False)
(VllmWorker TP2 pid=1190) INFO 08-13 17:30:52 [parallel_state.py:1134] rank 2 in world size 4 is assigned as DP rank 0, PP rank 0, TP rank 2, EP rank 2
(VllmWorker TP3 pid=1191) INFO 08-13 17:30:52 [parallel_state.py:1134] rank 3 in world size 4 is assigned as DP rank 0, PP rank 0, TP rank 3, EP rank 3
(VllmWorker TP1 pid=1189) INFO 08-13 17:30:52 [parallel_state.py:1134] rank 1 in world size 4 is assigned as DP rank 0, PP rank 0, TP rank 1, EP rank 1
(VllmWorker TP0 pid=1188) INFO 08-13 17:30:52 [parallel_state.py:1134] rank 0 in world size 4 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
(VllmWorker TP0 pid=1188) INFO 08-13 17:30:53 [topk_topp_sampler.py:50] Using FlashInfer for top-p & top-k sampling.
(VllmWorker TP1 pid=1189) INFO 08-13 17:30:53 [topk_topp_sampler.py:50] Using FlashInfer for top-p & top-k sampling.
(VllmWorker TP2 pid=1190) INFO 08-13 17:30:53 [topk_topp_sampler.py:50] Using FlashInfer for top-p & top-k sampling.
(VllmWorker TP3 pid=1191) INFO 08-13 17:30:53 [topk_topp_sampler.py:50] Using FlashInfer for top-p & top-k sampling.
(VllmWorker TP0 pid=1188) Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.
(VllmWorker TP1 pid=1189) Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.
(VllmWorker TP2 pid=1190) Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.
(VllmWorker TP3 pid=1191) Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.

Who can help? / 谁可以帮助到您?

No response

Information / 问题信息

  • The official example scripts / 官方的示例脚本
  • My own modified scripts / 我自己修改的脚本和任务

Reproduction / 复现过程

docker pull docker.io/vllm/vllm-openai:v0.10.0
docker run --entrypoint /bin/bash -itd --gpus=all --name=glm-4.5v --net=host --shm-size=4g docker.io/vllm/vllm-openai:v0.10.0
docker exec -it /bin/bash

In container:

pip install -U vllm --pre --extra-index-url https://wheels.vllm.ai/nightly
pip install transformers-v4.55.0-GLM-4.5V-preview
CUDA_VISIBLE_DEVICES=0,1,2,3 vllm serve /mnt/models/GLM-4.5V --trust-remote-code \
     --tensor-parallel-size 4 \
     --tool-call-parser glm45 \
     --reasoning-parser glm45 \
     --enable-auto-tool-choice \
     --served-model-name glm_45v \
     --media-io-kwargs '{"video": {"num_frames": -1}}'

Expected behavior / 期待表现

正常运行推理

Metadata

Metadata

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions