Just a learning exercise so i understand hardware better and design non-stupid architectures. CuTe DSL is really good. Everyone should try it.
[X] Initial Version without much performance tuning
[X] Tiling + TMA to load tiles.
[X] Ensure Swizzled SMEM access for Tensor Cores
[X] Add pipelining to hide TMA bandwidth
[] Add Epilogue
[] Tune Performance and get benchmarks
[] Allow arbitrary matrices that does not fit the tiling
[] FA-3 implementation in CuTe
[] NSA implementation in CuTe
- Ensure that cuda-toolkit-12.9 is installed. Any other versions will not work with CuTe DSL
pip install uv && uv pip install nvidia-cutlass-dsl torch
python gemm.py