Why not use sync after loading from TMEM to RMEM in example 02_mma_tma_sm100.cu

Accroding to ptx, tcgen05.ld is an async instruction, so why not use `tcgen05.wait` or its wrapper after loading in `02_mma_tma_sm100.cu`.

https://github.com/NVIDIA/cutlass/blob/3b054767b3fbc86de4985f0537068eceef9345be/examples/cute/tutorial/blackwell/02_mma_tma_sm100.cu#L354-L360

	// Load TMEM -> RMEM
	copy(tiled_t2r_copy, tDtAcc, tDrAcc);

	// AXPBY RMEM -> RMEM: tDrC = alpha * tDrAcc + beta * tDrC
	axpby(alpha, tDrAcc, beta, tDrC);
	// Store RMEM -> GMEM
	copy(tDrC, tDgD);

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Why not use sync after loading from TMEM to RMEM in example 02_mma_tma_sm100.cu #2525

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Why not use sync after loading from TMEM to RMEM in example 02_mma_tma_sm100.cu #2525

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions