GitHub - punwai/awesome-cute: CuTe, simple kernels: GEMM, FA-3, NSA

Efficient GEMM implementation in CuTe DSL for the Hopper GPU

Just a learning exercise so i understand hardware better and design non-stupid architectures. CuTe DSL is really good. Everyone should try it.

5/28/2025

[X] Initial Version without much performance tuning

[X] Tiling + TMA to load tiles.

[X] Ensure Swizzled SMEM access for Tensor Cores

[X] Add pipelining to hide TMA bandwidth

5/29/2025

[] Add Epilogue

[] Tune Performance and get benchmarks

[] Allow arbitrary matrices that does not fit the tiling

5/30/2025-???

[] FA-3 implementation in CuTe

[] NSA implementation in CuTe

Setup:

Ensure that cuda-toolkit-12.9 is installed. Any other versions will not work with CuTe DSL
pip install uv && uv pip install nvidia-cutlass-dsl torch

Run:

python gemm.py

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
cutlass		cutlass
README.md		README.md
gemm.py		gemm.py
output.txt		output.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Efficient GEMM implementation in CuTe DSL for the Hopper GPU

5/28/2025

5/29/2025

5/30/2025-???

Setup:

Run:

About

Uh oh!

Releases

Packages

Languages

punwai/awesome-cute

Folders and files

Latest commit

History

Repository files navigation

Efficient GEMM implementation in CuTe DSL for the Hopper GPU

5/28/2025

5/29/2025

5/30/2025-???

Setup:

Run:

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages