What is your question?
I would like to know how the cutlass profiler tests the performance of gemm?
Because for small matrices, the large L2 cache on the GPU will have a great impact on the measurement of its computation time. The tests of the cutlass profiler seem can avoid such interference