[QST] Is there any fp16xfp16 GEMM sample using CUTE with a performance comparable to cublas? #1686

xiaonans · 2024-08-06T07:02:37Z

What is your question?
I want to write my own fused fp16xfp16 gemm kernel with CUTE, but I can not find a tutorial/sample code with a performance comparable to cublas.

I noticed there are some tutorials in https://github.com/NVIDIA/cutlass/tree/main/examples/cute/tutorial, which has fp32xfp32 and int8xint8 gemm. But the performance of int8xint8 gemm is not good enough. I also noticed a 3rd party of fp16xfp16 gemm with CUTE https://github.com/leimao/CUDA-GEMM-Optimization?tab=readme-ov-file, but as shown in the readme, the performance is yet not comparable to cublas. So I wonder whether CUTE can give an official fp16xfp16 gemm kernel with good performance, so that I can develop based on that?

xiaonans added ? - Needs Triage question Question labels Aug 6, 2024

mnicely added feature request New feature or request help wanted Interested in support from contributors good first issue Good first issue for contributors and removed ? - Needs Triage labels Aug 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[QST] Is there any fp16xfp16 GEMM sample using CUTE with a performance comparable to cublas? #1686

[QST] Is there any fp16xfp16 GEMM sample using CUTE with a performance comparable to cublas? #1686

xiaonans commented Aug 6, 2024

[QST] Is there any fp16xfp16 GEMM sample using CUTE with a performance comparable to cublas? #1686

[QST] Is there any fp16xfp16 GEMM sample using CUTE with a performance comparable to cublas? #1686

Comments

xiaonans commented Aug 6, 2024