[BUG] Sub-optimal Schedule of ZB Runtime

**Describe the bug**
For basic1f1b and ZB-V schedule in ZB runtime, I observe that the warm-up stage schedule is sub-optimal. 

<img width="1323" height="198" alt="Image" src="https://github.com/user-attachments/assets/95aba2e9-f8cc-4ca8-bdb4-debbfe5d532b" />
This is the profile trace of basic1f1b. The F-0 node in rank1 will begin after the F-1 node in rank0. Exactly, it can begin after the F-0 node in rank0. This problem can be observed in ZB-V too. 

**To Reproduce**
## script
````
#!/bin/bash

# Runs the "345M" parameter model

export CUDA_DEVICE_MAX_CONNECTIONS=8

GPUS_PER_NODE=4
# Change for multinode config
MASTER_ADDR=localhost
MASTER_PORT=6000
NNODES=1
NODE_RANK=0
WORLD_SIZE=$(($GPUS_PER_NODE*$NNODES))

CHECKPOINT_PATH=$1 #<Specify path>
VOCAB_FILE=/data/Megatron-LM/dataset/vocab/gpt2-vocab.json #<Specify path to file>/gpt2-vocab.json
MERGE_FILE=/data/Megatron-LM/dataset/vocab/gpt2-merges.txt  #<Specify path to file>/gpt2-merges.txt
DATA_PATH=/data/Megatron-LM/dataset/my-gpt2_text_document #<Specify path and file prefix>_text_document

DISTRIBUTED_ARGS="
    --nproc_per_node $GPUS_PER_NODE \
    --nnodes $NNODES \
    --node_rank $NODE_RANK \
    --master_addr $MASTER_ADDR \
    --master_port $MASTER_PORT
"

GPT_ARGS="
    --tensor-model-parallel-size 1 \
    --pipeline-model-parallel-size 4 \
    --sequence-parallel \
    --num-layers 24 \
    --hidden-size 4096 \
    --num-attention-heads 16 \
    --seq-length 512 \
    --max-position-embeddings 8192 \
    --micro-batch-size 2 \
    --global-batch-size 16 \
    --lr 0.00015 \
    --train-iters 20 \
    --lr-decay-iters 320 \
    --lr-decay-style cosine \
    --min-lr 1.0e-5 \
    --weight-decay 1e-2 \
    --lr-warmup-fraction .01 \
    --clip-grad 1.0 \
    --fp16 \
    --attention-softmax-in-fp32 \
"

DATA_ARGS="
    --data-path $DATA_PATH \
    --vocab-file $VOCAB_FILE \
    --merge-file $MERGE_FILE \
    --split 949,50,1
"

OUTPUT_ARGS="
    --log-interval 20 \
    --save-interval 10000 \
    --eval-interval 1000 \
    --eval-iters 1
"

options=" \
    --untie-embeddings-and-output-weights \
    --enable-zb-runtime \
    --no-pre-communication-optimization \
    --enable-optimizer-post-validation \
    --transformer-impl local \
    --no-create-attention-mask-in-dataloader \
    --use-legacy-models
"

torchrun $DISTRIBUTED_ARGS pretrain_gpt.py \
    $GPT_ARGS \
    $DATA_ARGS \
    $OUTPUT_ARGS \
    $options \
    --distributed-backend nccl \
    # --save $CHECKPOINT_PATH \
    # --load $CHECKPOINT_PATH
````
### print the schedule
````
 rank = torch.distributed.get_rank()
 print(f'##### rank {rank} schedules:')
 for node in conf.schedules:
 print(f"{node.type}-{node.microbatch}-{node.chunk}-{node.seq_split_idx}")
```` 




Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[BUG] Sub-optimal Schedule of ZB Runtime #73

script

print the schedule

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[BUG] Sub-optimal Schedule of ZB Runtime #73

Description

script

print the schedule

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions