What is your question?
Internal CUTLASS error is observed, when I try increasing the warp count for kernel "cutlass_simt_hgemm_256x128_8x2_nt_align1" to values other than default 4x2x1 (by changing the warpshape accordingly in the generated kernel). In the cutlass_profiler, this presents itself as a Disposition failure. These are the values I tried:
- 4x4x1
- 8x2x1
- 8x4x1
- 16x2x1

How should I debug this? Is there any proper documentation on how older SIMT GEMM kernels work?
We want more warps scheduled per sub-core, any insights into how one can achieve this apart from just making the warp tiles smaller?
PS: When I try increasing the warp count for kernel "cutlass_simt_sgemm_128x128_8x2_nt_align1", everything works okay.