Decode FA and MLP duration inconsistent with the theoretical value

I tried to calculate the theoretical duration of FA and route MLP, whereas the result is greatly smaller than data in profiling file. I'm not sure if i miscalculated or there are something underconsidaration.

BatchSize = 128 / 2 = 64
HDim = 7168
FFDim = 2048
NumHead = 128 / TP = 128
Q_HeadDim = 512 + 64 = 576
K_HeadDim = 512 + 64 = 576
V_HeadDim = 512
KV_SeqLen = 4096
ExpertNum = 256 / EP = 2
TopK = 8

1. FA (Computation bound):
input_dim: 
    Q [Batch, NumHead, 1, Q_HeadDim] = [64, 128, 1, 576]
    K [Batch, 1, KV_SeqLen, K_HeadDim] = [64, 1, 4096, 576]
    V [Batch, 1, KV_SeqLen, V_HeadDim] = [64, 1, 4096, 512]
For S = Q * K^T, FLOPS = 64 * 128 * 1 * 576 * 4096 * 2 = 36 GFLOPS, out_dim = [64, 128, 1, 4096]
For V = Score * V, FLOPS = 64 * 128 * 1 * 4096 * 512 * 2= 32 GFLOPS,
Total FLOPS = 68 GFLOPS, 
Use 580 TFLOPS mentioned in FlashMLA, duration = 68 / (580 * 1024) = **114 us**, while the avg duration in profiling file is **240 us;**

2. MLP:
Suppose all tokens are evenly assigned to 256 expert, TokenNum = Batch * TopK = 512 tokens per GPU,
Up+Gate:  X: [TokenNum, HDim] = [512, 7168], W: [ExpertNum, HDim, 2 * FFDim] = [2, 7168, 4096],
Down: X: [TokenNum,  FFDim] = [512, 2048], W: [ExpertNum, FFDim, HDim] = [2, 2048, 7196].
All the input tensors are FP8, and the output tensor for Up+Gate is FP32 for high precision calc in SiLU, and the output tensor for Down is FP16. 
The estimated input IO bytes = 512*7168 + 2*7168*4096 + 512*2048 + 2*2048*7168 = 88.5 MB
output IO bytes = 512 * 4096 * 4 + 512 * 7168 * 2 = 15 MB. 
**Notice that two weight consisted of 84 MB, so the influence of load imbalance is relatively low.**
The avg durations of Up+Gate and Down in profiling file are 78 us and 37 us, respectively. 
So the estimated memory bandwidth is (88.5+15) / (78+37) =**879 GB/s**, while the memory bandwidth in DeepGemm is **2000 GB/s**


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Decode FA and MLP duration inconsistent with the theoretical value #15

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Decode FA and MLP duration inconsistent with the theoretical value #15

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions