Skip to content

[QST] Is there any fp16xfp16 GEMM sample using CUTE with a performance comparable to cublas? #1686

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
xiaonans opened this issue Aug 6, 2024 · 0 comments
Labels
feature request New feature or request good first issue Good first issue for contributors help wanted Interested in support from contributors question Question

Comments

@xiaonans
Copy link

xiaonans commented Aug 6, 2024

What is your question?
I want to write my own fused fp16xfp16 gemm kernel with CUTE, but I can not find a tutorial/sample code with a performance comparable to cublas.

I noticed there are some tutorials in https://github.com/NVIDIA/cutlass/tree/main/examples/cute/tutorial, which has fp32xfp32 and int8xint8 gemm. But the performance of int8xint8 gemm is not good enough. I also noticed a 3rd party of fp16xfp16 gemm with CUTE https://github.com/leimao/CUDA-GEMM-Optimization?tab=readme-ov-file, but as shown in the readme, the performance is yet not comparable to cublas. So I wonder whether CUTE can give an official fp16xfp16 gemm kernel with good performance, so that I can develop based on that?

@mnicely mnicely added feature request New feature or request help wanted Interested in support from contributors good first issue Good first issue for contributors and removed ? - Needs Triage labels Aug 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request good first issue Good first issue for contributors help wanted Interested in support from contributors question Question
Projects
None yet
Development

No branches or pull requests

2 participants