-
Couldn't load subscription status.
- Fork 415
Open
Labels
Description
Hi, I just use the example LtMxfp8Matmul in 5090, when I test 4096x4096x4096 matmul case, in only use 210+us, we can compute the tflops around 600+TFLOPs, but 5090 peak fp8 tlops is 419, is there something wrong?
the kernel name is:
cutlass3x_sm120_bstensorop_s16832gemm_block_scaled_ue8m0xe4m3_ue8m0xe4m3_f32_bf16_ue8m0xe4m3_128x128x128_1x1x1_0_tnn_align16_q_bias_bf16_relu_epiVs32n
I think the accumulate type should be fp32 right?....
PS: I get this profile result by using Nsight system
