-
Notifications
You must be signed in to change notification settings - Fork 1.2k
[QST] How to do concurrent GEMMs ? #1418
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Whether multiple grids form different streams run concurrently or not is a property of their occupancy etc. what are you trying to achieve? You had another thread a few days ago where it seemed like batched GEMM was sufficient for your use case. Generally speaking horizontal fusions like that will yield you best results. Based on your requirements you can pick batched GEMM, pointer array batched GEMM, or grouped GEMM in that order of preference. |
Hello, A0 x {B0,B1,B2,B3}, A1 x {B0,B1,B2,B3}, A2 x {B0,B1,B2,B3}, A3 x {B0,B1,B2,B3} Would doing this as 4 batched GEMMs be a good option (by setting the A stride to 0 in each GEMM)? |
if the size of A[0-3] are the same and B[0-3] are the same, use batched gemm, otherwise use group gemm. your size is too small for multiple stream to run in the same time. when the 2nd grid is launched, the first 1 is finished. |
I ended up using GemmArray, since that allowed me to send the pointers instead of allocating the same matrices multiple times. I did try using streams and the reported behaviour was that they did not run concurrently. Nvidia profiler reports only 50% compute throughput for the kernel, is that normal for this kind of GEMM implementation? |
your kernel is too tiny. they are memory bound. 50% sounds reasonable. |
Even when using a batch size in the order of 100s? Thank you. |
This issue has been labeled |
This issue has been labeled |
Hello,
Is it possible to launch concurrent GEMMs from the host? (only using more CPU threads as a last resort) I have used streams, but they are not running concurrently, but sequentially (which was what I expected from the code). Is there a way to do it or is there a more efficient way using another kind of GEMM instead of the basic template? Thank you, below is the code I am using to call the GEMMs.
The text was updated successfully, but these errors were encountered: