Skip to content

LtMxfp8matmul tflops issue #283

@MARD1NO

Description

@MARD1NO

Hi, I just use the example LtMxfp8Matmul in 5090, when I test 4096x4096x4096 matmul case, in only use 210+us, we can compute the tflops around 600+TFLOPs, but 5090 peak fp8 tlops is 419, is there something wrong?

the kernel name is:

cutlass3x_sm120_bstensorop_s16832gemm_block_scaled_ue8m0xe4m3_ue8m0xe4m3_f32_bf16_ue8m0xe4m3_128x128x128_1x1x1_0_tnn_align16_q_bias_bf16_relu_epiVs32n

I think the accumulate type should be fp32 right?....

Image

PS: I get this profile result by using Nsight system

Image

Metadata

Metadata

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions