[QST] [CuTeDSL] Redudant wait in Hopper example

In the Hopper example we perform (in L859)

```python
cute.nvgpu.warpgroup.wait_group(k_pipe_mmas)
```

As I understand it this is not necessary because we wait for all the commited wgmma instructions before we perform the epilog (in L918):

```python
cute.nvgpu.warpgroup.wait_group(0)
```