Skip to content

It's impossible to connect to vllm server in 0.9.1 and run GRPO in server mode #3648

@ahatamiz

Description

@ahatamiz

Reproduction

As the title says, it is impossible to train with GRPO in server mode using vllm 0.9.1 release.

We are simply following previous recipes for server mode: first, start a dedicated node for vllm as the server. Then start the training and connect to the server.

Here's a minimal slurm script:

#!/bin/bash
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=1
#SBATCH --exclusive
#SBATCH --gres=gpu:8
#SBATCH --dependency=singleton

TRAINING_SCRIPT="../src/open_r1/grpo.py" 

ACCELERATE_CONFIG_FILE="../recipes/accelerate_configs/zero2.yaml"
CONFIG_FILE="../recipes/qwen2.5/grpo_qwen2.5_math_serv.yaml"

MODEL=deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B

LOGS_DIR="${SAVE_DIR}/${NAME}/"
mkdir -p "${LOGS_DIR}"
DATETIME=$(date +'%Y-%m-%d_%H-%M-%S')

# -----------------
# 2) SLURM Node Info
# -----------------
NODELIST=($(scontrol show hostnames $SLURM_JOB_NODELIST))
TOTAL_NODES=$SLURM_NNODES
MASTER_NODE="${NODELIST[0]}"
MASTER_ADDR="$MASTER_NODE"
MASTER_PORT=6000

# We'll reserve the last node for vLLM
VLLM_NODE="${NODELIST[-1]}"

TRAINING_NUM_NODES=$((TOTAL_NODES - 1))

if [ $TRAINING_NUM_NODES -le 0 ]; then
    echo "ERROR: Need at least 2 nodes total to do 'vLLM node + training nodes'."
    exit 1
fi

TRAIN_NODES=("${NODELIST[@]:0:$TRAINING_NUM_NODES}")

# -----------------
# 3) Parallel Setup
# -----------------
TP=1
DP=8
WORLD_SIZE=$((TRAINING_NUM_NODES * GPUS_PER_NODE))

# -----------------
# 4) Launch vLLM (Background)
# -----------------
VLLM_PORT=8000

srun \
  --nodes=1 \
  --ntasks=1 \
  --nodelist="$VLLM_NODE" \
  --container-env=ALL \
  --output="${LOGS_DIR}/vllm_%x_${DATETIME}.log" \
bash -c "
  echo \"[vLLM Node] Starting vLLM on \$(hostname -s)\"
  CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \\
    python -m trl.scripts.vllm_serve \
      --model \"$MODEL\" \\
      --tensor_parallel_size \"$TP\" \\
      --data_parallel_size \"$DP\" \\
      --host \"$VLLM_NODE\"
      --port \"$VLLM_PORT\"
" &

sleep 5  # minimal pause so the srun for vLLM can be in flight

# -----------------
# 5) Launch Training
# -----------------
# Now do a second srun for the training side on the other nodes:
srun \
  --nodes="$TRAINING_NUM_NODES" \
  --ntasks="$TRAINING_NUM_NODES" \
  --nodelist="${TRAIN_NODES[*]}" \
  --container-env=ALL \
  --output="${LOGS_DIR}/train_%x_${DATETIME}.log" \
bash -c "
  CURRENT_NODE=\$(hostname -s)
  MACHINE_RANK=\$SLURM_PROCID
  accelerate launch --config_file $ACCELERATE_CONFIG_FILE \\
    --num_machines $TRAINING_NUM_NODES \\
    --num_processes $WORLD_SIZE \\
    --main_process_ip $MASTER_ADDR \\
    --main_process_port $MASTER_PORT \\
    --machine_rank \$MACHINE_RANK \\
    --rdzv_backend=c10d \\
    --max_restarts 3 \\
    --tee 3 \\
    $TRAINING_SCRIPT \\
      --config $CONFIG_FILE \\
      --dataset-prompt-column problem \\
      --vllm_server_host $VLLM_NODE \\
      --vllm_server_port 8000
"

echo "All tasks completed at $(date)"

But according to the logs, I confirm that the vllm server runs perfectly. It is just that we cannot connect to it.

Something has happened specifically in 0.9.1 vllm release that causes this issue.

Are there some changes in the VLLMClient that may cause this ?

@qgallouedec @shirinyamani for Viz.

System Info

vllm=0.9.1
trl[vllm]=0.19.0
transformers=4.52.3
torch=2.6.0

Checklist

  • I have checked that my issue isn't already filed (see open issues)
  • I have included my system information
  • Any code provided is minimal, complete, and reproducible (more on MREs)
  • Any code provided is properly formatted in code blocks, (no screenshot, more on code blocks)
  • Any traceback provided is complete

Metadata

Metadata

Assignees

No one assigned

    Labels

    🏋 GRPORelated to GRPO🐛 bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions