Skip to content

[BUG] Sub-optimal Schedule of ZB Runtime #73

@hua0x522

Description

@hua0x522

Describe the bug
For basic1f1b and ZB-V schedule in ZB runtime, I observe that the warm-up stage schedule is sub-optimal.

Image This is the profile trace of basic1f1b. The F-0 node in rank1 will begin after the F-1 node in rank0. Exactly, it can begin after the F-0 node in rank0. This problem can be observed in ZB-V too.

To Reproduce

script

#!/bin/bash

# Runs the "345M" parameter model

export CUDA_DEVICE_MAX_CONNECTIONS=8

GPUS_PER_NODE=4
# Change for multinode config
MASTER_ADDR=localhost
MASTER_PORT=6000
NNODES=1
NODE_RANK=0
WORLD_SIZE=$(($GPUS_PER_NODE*$NNODES))

CHECKPOINT_PATH=$1 #<Specify path>
VOCAB_FILE=/data/Megatron-LM/dataset/vocab/gpt2-vocab.json #<Specify path to file>/gpt2-vocab.json
MERGE_FILE=/data/Megatron-LM/dataset/vocab/gpt2-merges.txt  #<Specify path to file>/gpt2-merges.txt
DATA_PATH=/data/Megatron-LM/dataset/my-gpt2_text_document #<Specify path and file prefix>_text_document

DISTRIBUTED_ARGS="
    --nproc_per_node $GPUS_PER_NODE \
    --nnodes $NNODES \
    --node_rank $NODE_RANK \
    --master_addr $MASTER_ADDR \
    --master_port $MASTER_PORT
"

GPT_ARGS="
    --tensor-model-parallel-size 1 \
    --pipeline-model-parallel-size 4 \
    --sequence-parallel \
    --num-layers 24 \
    --hidden-size 4096 \
    --num-attention-heads 16 \
    --seq-length 512 \
    --max-position-embeddings 8192 \
    --micro-batch-size 2 \
    --global-batch-size 16 \
    --lr 0.00015 \
    --train-iters 20 \
    --lr-decay-iters 320 \
    --lr-decay-style cosine \
    --min-lr 1.0e-5 \
    --weight-decay 1e-2 \
    --lr-warmup-fraction .01 \
    --clip-grad 1.0 \
    --fp16 \
    --attention-softmax-in-fp32 \
"

DATA_ARGS="
    --data-path $DATA_PATH \
    --vocab-file $VOCAB_FILE \
    --merge-file $MERGE_FILE \
    --split 949,50,1
"

OUTPUT_ARGS="
    --log-interval 20 \
    --save-interval 10000 \
    --eval-interval 1000 \
    --eval-iters 1
"

options=" \
    --untie-embeddings-and-output-weights \
    --enable-zb-runtime \
    --no-pre-communication-optimization \
    --enable-optimizer-post-validation \
    --transformer-impl local \
    --no-create-attention-mask-in-dataloader \
    --use-legacy-models
"

torchrun $DISTRIBUTED_ARGS pretrain_gpt.py \
    $GPT_ARGS \
    $DATA_ARGS \
    $OUTPUT_ARGS \
    $options \
    --distributed-backend nccl \
    # --save $CHECKPOINT_PATH \
    # --load $CHECKPOINT_PATH

print the schedule

 rank = torch.distributed.get_rank()
 print(f'##### rank {rank} schedules:')
 for node in conf.schedules:
 print(f"{node.type}-{node.microbatch}-{node.chunk}-{node.seq_split_idx}")

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions