suppose R=2, S=2, C=128, ThreadBlockShape=<128,128,64>.
When main loop in implicit GEMM_K, the memory access sequence in cutlass will be r=0,s=0,c=0-63, r=0,s=1,c=0-63, r=1,s=0,c=0-63, r=1,s=1,c=0-63, r=0,s=0,c=63-127, r=0,s=1,c=63-127, r=1,s=0,c=63-127, r=1,s=1,c=63-127.
But why not r=0,s=0,c=0-63, r=0,s=0,c=63-127, r=0,s=1,c=0-63, r=0,s=1,c=63-127 r=1,s=0,c=0-63, r=1,s=0,c=63-127, r=1,s=1,c=0-63, r=1,s=1,c=63-127, Isn't it a better memory access strategy to access C first?