-
Couldn't load subscription status.
- Fork 1.5k
Description
Issue
CUTLASS GemmDevice Operator contains compile-time attributes (functional and performance attribute). The GemmDevice Operator is consumed by GemmOperation[3xBase]. In the past, I have found some of the values in this data structure incorrect and sometimes just completely missing.
For e.g. given the kernel by its full procedural name = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma.
This kernel has the following functional and performance attribute:
Functional Attribute
- dtypeA_dtypeB_dtypeAccumulation_dtypeC_dtypeD : e4m3_e4m3_f32_bf16_bf16
- RowMajor_ColumnMajor_ColumnMajor : tnn
Performance Attribute
- Instruction Shape : 64x128x32
- Threadblock Shape : 128x128x128
- Cluster Shape : 1x2x1
- Mainloop Kind : warpspecialized_cooperative // This is missing from GemmDescription
- Epilogue Kind : epi_tma // This is missing from GemmDescription
- AccumulationKind : "fastaccum" or default "" // This is missing from GemmDescription
I have added a test so someone at NVIDIA can start on this. Can you please uncomment the two lines, add whatever is needed to fix this?
You can follow any other enum that is lifted up to GemmDevice Operator from internal templates and used to set the data members of GemmDescription class.
We can then commit this test and add more for Hopper and make sure this class is also covered for Blackwell. The tests are CPU-only and should not take too much time in the CI, this will allow us to catch bugs like this one.