Skip to content

Decode FA and MLP duration inconsistent with the theoretical value #15

@byf1999

Description

@byf1999

I tried to calculate the theoretical duration of FA and route MLP, whereas the result is greatly smaller than data in profiling file. I'm not sure if i miscalculated or there are something underconsidaration.

BatchSize = 128 / 2 = 64
HDim = 7168
FFDim = 2048
NumHead = 128 / TP = 128
Q_HeadDim = 512 + 64 = 576
K_HeadDim = 512 + 64 = 576
V_HeadDim = 512
KV_SeqLen = 4096
ExpertNum = 256 / EP = 2
TopK = 8

  1. FA (Computation bound):
    input_dim:
    Q [Batch, NumHead, 1, Q_HeadDim] = [64, 128, 1, 576]
    K [Batch, 1, KV_SeqLen, K_HeadDim] = [64, 1, 4096, 576]
    V [Batch, 1, KV_SeqLen, V_HeadDim] = [64, 1, 4096, 512]
    For S = Q * K^T, FLOPS = 64 * 128 * 1 * 576 * 4096 * 2 = 36 GFLOPS, out_dim = [64, 128, 1, 4096]
    For V = Score * V, FLOPS = 64 * 128 * 1 * 4096 * 512 * 2= 32 GFLOPS,
    Total FLOPS = 68 GFLOPS,
    Use 580 TFLOPS mentioned in FlashMLA, duration = 68 / (580 * 1024) = 114 us, while the avg duration in profiling file is 240 us;

  2. MLP:
    Suppose all tokens are evenly assigned to 256 expert, TokenNum = Batch * TopK = 512 tokens per GPU,
    Up+Gate: X: [TokenNum, HDim] = [512, 7168], W: [ExpertNum, HDim, 2 * FFDim] = [2, 7168, 4096],
    Down: X: [TokenNum, FFDim] = [512, 2048], W: [ExpertNum, FFDim, HDim] = [2, 2048, 7196].
    All the input tensors are FP8, and the output tensor for Up+Gate is FP32 for high precision calc in SiLU, and the output tensor for Down is FP16.
    The estimated input IO bytes = 5127168 + 271684096 + 5122048 + 220487168 = 88.5 MB
    output IO bytes = 512 * 4096 * 4 + 512 * 7168 * 2 = 15 MB.
    Notice that two weight consisted of 84 MB, so the influence of load imbalance is relatively low.
    The avg durations of Up+Gate and Down in profiling file are 78 us and 37 us, respectively.
    So the estimated memory bandwidth is (88.5+15) / (78+37) =879 GB/s, while the memory bandwidth in DeepGemm is 2000 GB/s

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions