[QST] Is there any fp16xfp16 GEMM sample using CUTE with a performance comparable to cublas? #1686
Labels
feature request
New feature or request
good first issue
Good first issue for contributors
help wanted
Interested in support from contributors
question
Question
What is your question?
I want to write my own fused fp16xfp16 gemm kernel with CUTE, but I can not find a tutorial/sample code with a performance comparable to cublas.
I noticed there are some tutorials in https://github.com/NVIDIA/cutlass/tree/main/examples/cute/tutorial, which has fp32xfp32 and int8xint8 gemm. But the performance of int8xint8 gemm is not good enough. I also noticed a 3rd party of fp16xfp16 gemm with CUTE https://github.com/leimao/CUDA-GEMM-Optimization?tab=readme-ov-file, but as shown in the readme, the performance is yet not comparable to cublas. So I wonder whether CUTE can give an official fp16xfp16 gemm kernel with good performance, so that I can develop based on that?
The text was updated successfully, but these errors were encountered: