In the Hopper example we perform (in L859)
cute.nvgpu.warpgroup.wait_group(k_pipe_mmas)
As I understand it this is not necessary because we wait for all the commited wgmma instructions before we perform the epilog (in L918):
cute.nvgpu.warpgroup.wait_group(0)