-
Notifications
You must be signed in to change notification settings - Fork 1.2k
[QST] How does make_tiled_copy_A determine the source address for copying? #2232
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@ccecka Thanks for your reply! I just have one remaining question about this:
The Thr-Val layout of LHS of this CPY seems In other words, without using
|
The complete partitioning patterns can't be derived from the Atoms, only the |
Take the following code as an example:
In this implementation,
SM80_16x8x16_F16F16F16F16_TN
is used as thetiled_mma
, whileSM75_U32x4_LDSM_N
serves as the copy operation. Thetiled_mma
is then used to constructs2r_tiled_copy_a
.As we know, in the copy function for matrix A, each thread handles source addressing to perform 16x16 matrix copying. For
SM80_16x8x16_F16F16F16F16_TN
andSM75_U32x4_LDSM_N
,the source address layout for the copy operation is as follows:I'm confused about how
make_tiled_copy_A
assigns source addresses to each thread. Specifically, what determines the thread-to-address mapping pattern shown in the upper diagram versus the alternative pattern below?From what I understand,
mma_atom
andCopy_Atom
doesn't seem to provide the relevant information:The text was updated successfully, but these errors were encountered: