I will PR this into this repo at some point, but I think I got a working source for windows MSVC with CUTLASS 4.1 and CUDA 12.9
For now I will just leave this here for crawlers and indexers, have not bench-marked (fully) anything. Will get results at some point.
FYI. at the end I only got sm_120(ONLY) to build so sm_100a will not have support as with sm_90 (I hate Nvidia sm) (Thanks for attending my Ted talk)
https://github.com/IISuperluminaLII/FlashMLA_Windows_sm120