What is your question?
I wrote a kernel based on cute and cutlass, which used cutlass::arch::warpgroup_reg_alloc and wgmma.mma_async.
I wrote the kernel in a kernel.cuh file and included the kernel.cuh in src1.cu. Since my project contains multiple cu files, such as src2.cu, src3.cu. I need to use rdc=True to compile my project.
However, the following warning appeared during the compilation process.
ptxas info : (C7504) Potential Performance Loss: 'setmaxnreg' ignored to maintain compatibility across compilation units.
ptxas info : (C7509) Potential Performance Loss: wgmma.mma_async instructions are serialized due to the presence of Extern calls in the function
I can be sure that I have used the -DNDEBUG flag. I'm using cuda12.8 on H20.