generated from fastai/nbdev_template
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
Open
Labels
Description
Reproduction
As the title says, it is impossible to train with GRPO in server mode using vllm 0.9.1 release.
We are simply following previous recipes for server mode: first, start a dedicated node for vllm as the server. Then start the training and connect to the server.
Here's a minimal slurm script:
#!/bin/bash
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=1
#SBATCH --exclusive
#SBATCH --gres=gpu:8
#SBATCH --dependency=singleton
TRAINING_SCRIPT="../src/open_r1/grpo.py"
ACCELERATE_CONFIG_FILE="../recipes/accelerate_configs/zero2.yaml"
CONFIG_FILE="../recipes/qwen2.5/grpo_qwen2.5_math_serv.yaml"
MODEL=deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
LOGS_DIR="${SAVE_DIR}/${NAME}/"
mkdir -p "${LOGS_DIR}"
DATETIME=$(date +'%Y-%m-%d_%H-%M-%S')
# -----------------
# 2) SLURM Node Info
# -----------------
NODELIST=($(scontrol show hostnames $SLURM_JOB_NODELIST))
TOTAL_NODES=$SLURM_NNODES
MASTER_NODE="${NODELIST[0]}"
MASTER_ADDR="$MASTER_NODE"
MASTER_PORT=6000
# We'll reserve the last node for vLLM
VLLM_NODE="${NODELIST[-1]}"
TRAINING_NUM_NODES=$((TOTAL_NODES - 1))
if [ $TRAINING_NUM_NODES -le 0 ]; then
echo "ERROR: Need at least 2 nodes total to do 'vLLM node + training nodes'."
exit 1
fi
TRAIN_NODES=("${NODELIST[@]:0:$TRAINING_NUM_NODES}")
# -----------------
# 3) Parallel Setup
# -----------------
TP=1
DP=8
WORLD_SIZE=$((TRAINING_NUM_NODES * GPUS_PER_NODE))
# -----------------
# 4) Launch vLLM (Background)
# -----------------
VLLM_PORT=8000
srun \
--nodes=1 \
--ntasks=1 \
--nodelist="$VLLM_NODE" \
--container-env=ALL \
--output="${LOGS_DIR}/vllm_%x_${DATETIME}.log" \
bash -c "
echo \"[vLLM Node] Starting vLLM on \$(hostname -s)\"
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \\
python -m trl.scripts.vllm_serve \
--model \"$MODEL\" \\
--tensor_parallel_size \"$TP\" \\
--data_parallel_size \"$DP\" \\
--host \"$VLLM_NODE\"
--port \"$VLLM_PORT\"
" &
sleep 5 # minimal pause so the srun for vLLM can be in flight
# -----------------
# 5) Launch Training
# -----------------
# Now do a second srun for the training side on the other nodes:
srun \
--nodes="$TRAINING_NUM_NODES" \
--ntasks="$TRAINING_NUM_NODES" \
--nodelist="${TRAIN_NODES[*]}" \
--container-env=ALL \
--output="${LOGS_DIR}/train_%x_${DATETIME}.log" \
bash -c "
CURRENT_NODE=\$(hostname -s)
MACHINE_RANK=\$SLURM_PROCID
accelerate launch --config_file $ACCELERATE_CONFIG_FILE \\
--num_machines $TRAINING_NUM_NODES \\
--num_processes $WORLD_SIZE \\
--main_process_ip $MASTER_ADDR \\
--main_process_port $MASTER_PORT \\
--machine_rank \$MACHINE_RANK \\
--rdzv_backend=c10d \\
--max_restarts 3 \\
--tee 3 \\
$TRAINING_SCRIPT \\
--config $CONFIG_FILE \\
--dataset-prompt-column problem \\
--vllm_server_host $VLLM_NODE \\
--vllm_server_port 8000
"
echo "All tasks completed at $(date)"
But according to the logs, I confirm that the vllm server runs perfectly. It is just that we cannot connect to it.
Something has happened specifically in 0.9.1 vllm release that causes this issue.
Are there some changes in the VLLMClient that may cause this ?
@qgallouedec @shirinyamani for Viz.
System Info
vllm=0.9.1
trl[vllm]=0.19.0
transformers=4.52.3
torch=2.6.0
Checklist
- I have checked that my issue isn't already filed (see open issues)
- I have included my system information
- Any code provided is minimal, complete, and reproducible (more on MREs)
- Any code provided is properly formatted in code blocks, (no screenshot, more on code blocks)
- Any traceback provided is complete