-
Notifications
You must be signed in to change notification settings - Fork 143
Description
I tried to calculate the theoretical duration of FA and route MLP, whereas the result is greatly smaller than data in profiling file. I'm not sure if i miscalculated or there are something underconsidaration.
BatchSize = 128 / 2 = 64
HDim = 7168
FFDim = 2048
NumHead = 128 / TP = 128
Q_HeadDim = 512 + 64 = 576
K_HeadDim = 512 + 64 = 576
V_HeadDim = 512
KV_SeqLen = 4096
ExpertNum = 256 / EP = 2
TopK = 8
-
FA (Computation bound):
input_dim:
Q [Batch, NumHead, 1, Q_HeadDim] = [64, 128, 1, 576]
K [Batch, 1, KV_SeqLen, K_HeadDim] = [64, 1, 4096, 576]
V [Batch, 1, KV_SeqLen, V_HeadDim] = [64, 1, 4096, 512]
For S = Q * K^T, FLOPS = 64 * 128 * 1 * 576 * 4096 * 2 = 36 GFLOPS, out_dim = [64, 128, 1, 4096]
For V = Score * V, FLOPS = 64 * 128 * 1 * 4096 * 512 * 2= 32 GFLOPS,
Total FLOPS = 68 GFLOPS,
Use 580 TFLOPS mentioned in FlashMLA, duration = 68 / (580 * 1024) = 114 us, while the avg duration in profiling file is 240 us; -
MLP:
Suppose all tokens are evenly assigned to 256 expert, TokenNum = Batch * TopK = 512 tokens per GPU,
Up+Gate: X: [TokenNum, HDim] = [512, 7168], W: [ExpertNum, HDim, 2 * FFDim] = [2, 7168, 4096],
Down: X: [TokenNum, FFDim] = [512, 2048], W: [ExpertNum, FFDim, HDim] = [2, 2048, 7196].
All the input tensors are FP8, and the output tensor for Up+Gate is FP32 for high precision calc in SiLU, and the output tensor for Down is FP16.
The estimated input IO bytes = 5127168 + 271684096 + 5122048 + 220487168 = 88.5 MB
output IO bytes = 512 * 4096 * 4 + 512 * 7168 * 2 = 15 MB.
Notice that two weight consisted of 84 MB, so the influence of load imbalance is relatively low.
The avg durations of Up+Gate and Down in profiling file are 78 us and 37 us, respectively.
So the estimated memory bandwidth is (88.5+15) / (78+37) =879 GB/s, while the memory bandwidth in DeepGemm is 2000 GB/s