It's impossible to connect to vllm server in 0.9.1 and run GRPO in server mode

### Reproduction

As the title says, it is impossible to train with GRPO in server mode using vllm 0.9.1 release. 

We are simply following previous recipes for server mode: first, start a dedicated node for vllm as the server. Then start the training and connect to the server.

Here's a minimal slurm script:

```
#!/bin/bash
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=1
#SBATCH --exclusive
#SBATCH --gres=gpu:8
#SBATCH --dependency=singleton

TRAINING_SCRIPT="../src/open_r1/grpo.py" 

ACCELERATE_CONFIG_FILE="../recipes/accelerate_configs/zero2.yaml"
CONFIG_FILE="../recipes/qwen2.5/grpo_qwen2.5_math_serv.yaml"

MODEL=deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B

LOGS_DIR="${SAVE_DIR}/${NAME}/"
mkdir -p "${LOGS_DIR}"
DATETIME=$(date +'%Y-%m-%d_%H-%M-%S')

# -----------------
# 2) SLURM Node Info
# -----------------
NODELIST=($(scontrol show hostnames $SLURM_JOB_NODELIST))
TOTAL_NODES=$SLURM_NNODES
MASTER_NODE="${NODELIST[0]}"
MASTER_ADDR="$MASTER_NODE"
MASTER_PORT=6000

# We'll reserve the last node for vLLM
VLLM_NODE="${NODELIST[-1]}"

TRAINING_NUM_NODES=$((TOTAL_NODES - 1))

if [ $TRAINING_NUM_NODES -le 0 ]; then
    echo "ERROR: Need at least 2 nodes total to do 'vLLM node + training nodes'."
    exit 1
fi

TRAIN_NODES=("${NODELIST[@]:0:$TRAINING_NUM_NODES}")

# -----------------
# 3) Parallel Setup
# -----------------
TP=1
DP=8
WORLD_SIZE=$((TRAINING_NUM_NODES * GPUS_PER_NODE))

# -----------------
# 4) Launch vLLM (Background)
# -----------------
VLLM_PORT=8000

srun \
  --nodes=1 \
  --ntasks=1 \
  --nodelist="$VLLM_NODE" \
  --container-env=ALL \
  --output="${LOGS_DIR}/vllm_%x_${DATETIME}.log" \
bash -c "
  echo \"[vLLM Node] Starting vLLM on \$(hostname -s)\"
  CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \\
    python -m trl.scripts.vllm_serve \
      --model \"$MODEL\" \\
      --tensor_parallel_size \"$TP\" \\
      --data_parallel_size \"$DP\" \\
      --host \"$VLLM_NODE\"
      --port \"$VLLM_PORT\"
" &

sleep 5  # minimal pause so the srun for vLLM can be in flight

# -----------------
# 5) Launch Training
# -----------------
# Now do a second srun for the training side on the other nodes:
srun \
  --nodes="$TRAINING_NUM_NODES" \
  --ntasks="$TRAINING_NUM_NODES" \
  --nodelist="${TRAIN_NODES[*]}" \
  --container-env=ALL \
  --output="${LOGS_DIR}/train_%x_${DATETIME}.log" \
bash -c "
  CURRENT_NODE=\$(hostname -s)
  MACHINE_RANK=\$SLURM_PROCID
  accelerate launch --config_file $ACCELERATE_CONFIG_FILE \\
    --num_machines $TRAINING_NUM_NODES \\
    --num_processes $WORLD_SIZE \\
    --main_process_ip $MASTER_ADDR \\
    --main_process_port $MASTER_PORT \\
    --machine_rank \$MACHINE_RANK \\
    --rdzv_backend=c10d \\
    --max_restarts 3 \\
    --tee 3 \\
    $TRAINING_SCRIPT \\
      --config $CONFIG_FILE \\
      --dataset-prompt-column problem \\
      --vllm_server_host $VLLM_NODE \\
      --vllm_server_port 8000
"

echo "All tasks completed at $(date)"
```

But according to the logs, I confirm that the vllm server runs perfectly. It is just that we cannot connect to it.

Something has happened specifically in 0.9.1 vllm release that causes this issue.

Are there some changes in the VLLMClient that may cause this ?

@qgallouedec @shirinyamani for Viz. 

### System Info

```
vllm=0.9.1
trl[vllm]=0.19.0
transformers=4.52.3
torch=2.6.0
```

### Checklist

- [x] I have checked that my issue isn't already filed (see [open issues](https://github.com/huggingface/trl/issues?q=is%3Aissue))
- [x] I have included my system information
- [x] Any code provided is minimal, complete, and reproducible ([more on MREs](https://docs.github.com/en/get-started/writing-on-github/working-with-advanced-formatting/creating-and-highlighting-code-blocks))
- [x] Any code provided is properly formatted in code blocks, (no screenshot, [more on code blocks](https://docs.github.com/en/get-started/writing-on-github/working-with-advanced-formatting/creating-and-highlighting-code-blocks))
- [x] Any traceback provided is complete

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

It's impossible to connect to vllm server in 0.9.1 and run GRPO in server mode #3648

Reproduction

System Info

Checklist

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

It's impossible to connect to vllm server in 0.9.1 and run GRPO in server mode #3648

Description

Reproduction

System Info

Checklist

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions