forked from NVIDIA/Megatron-LM
-
Notifications
You must be signed in to change notification settings - Fork 27
Open
Description
Describe the bug
For basic1f1b and ZB-V schedule in ZB runtime, I observe that the warm-up stage schedule is sub-optimal.

To Reproduce
script
#!/bin/bash
# Runs the "345M" parameter model
export CUDA_DEVICE_MAX_CONNECTIONS=8
GPUS_PER_NODE=4
# Change for multinode config
MASTER_ADDR=localhost
MASTER_PORT=6000
NNODES=1
NODE_RANK=0
WORLD_SIZE=$(($GPUS_PER_NODE*$NNODES))
CHECKPOINT_PATH=$1 #<Specify path>
VOCAB_FILE=/data/Megatron-LM/dataset/vocab/gpt2-vocab.json #<Specify path to file>/gpt2-vocab.json
MERGE_FILE=/data/Megatron-LM/dataset/vocab/gpt2-merges.txt #<Specify path to file>/gpt2-merges.txt
DATA_PATH=/data/Megatron-LM/dataset/my-gpt2_text_document #<Specify path and file prefix>_text_document
DISTRIBUTED_ARGS="
--nproc_per_node $GPUS_PER_NODE \
--nnodes $NNODES \
--node_rank $NODE_RANK \
--master_addr $MASTER_ADDR \
--master_port $MASTER_PORT
"
GPT_ARGS="
--tensor-model-parallel-size 1 \
--pipeline-model-parallel-size 4 \
--sequence-parallel \
--num-layers 24 \
--hidden-size 4096 \
--num-attention-heads 16 \
--seq-length 512 \
--max-position-embeddings 8192 \
--micro-batch-size 2 \
--global-batch-size 16 \
--lr 0.00015 \
--train-iters 20 \
--lr-decay-iters 320 \
--lr-decay-style cosine \
--min-lr 1.0e-5 \
--weight-decay 1e-2 \
--lr-warmup-fraction .01 \
--clip-grad 1.0 \
--fp16 \
--attention-softmax-in-fp32 \
"
DATA_ARGS="
--data-path $DATA_PATH \
--vocab-file $VOCAB_FILE \
--merge-file $MERGE_FILE \
--split 949,50,1
"
OUTPUT_ARGS="
--log-interval 20 \
--save-interval 10000 \
--eval-interval 1000 \
--eval-iters 1
"
options=" \
--untie-embeddings-and-output-weights \
--enable-zb-runtime \
--no-pre-communication-optimization \
--enable-optimizer-post-validation \
--transformer-impl local \
--no-create-attention-mask-in-dataloader \
--use-legacy-models
"
torchrun $DISTRIBUTED_ARGS pretrain_gpt.py \
$GPT_ARGS \
$DATA_ARGS \
$OUTPUT_ARGS \
$options \
--distributed-backend nccl \
# --save $CHECKPOINT_PATH \
# --load $CHECKPOINT_PATH
print the schedule
rank = torch.distributed.get_rank()
print(f'##### rank {rank} schedules:')
for node in conf.schedules:
print(f"{node.type}-{node.microbatch}-{node.chunk}-{node.seq_split_idx}")
Metadata
Metadata
Assignees
Labels
No labels