Skip to content

[BUG] Implicitly generate unexpected LDGSTS instructions for A100 #1231

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
cctry opened this issue Dec 4, 2023 · 3 comments
Open

[BUG] Implicitly generate unexpected LDGSTS instructions for A100 #1231

cctry opened this issue Dec 4, 2023 · 3 comments
Labels
bug Something isn't working inactive-30d inactive-90d

Comments

@cctry
Copy link

cctry commented Dec 4, 2023

Describe the bug
Using DefaultCopy on A100 implicitly generates the unexpected LDGSTS. Users are not aware of the need to commit and wait.

Steps/Code to reproduce bug

using GmemTiledCopy = decltype(make_tiled_copy(
    Copy_Atom<DefaultCopy, float>{},
    Layout<Shape<_16, _16>, Stride<_16, _1>>{}, 
    Layout<Shape<_1, _4>>{}));

__global__ void kernel(float *A) {
  __shared__ float smem[16 * 64];
  Tensor gA = make_tensor(make_gmem_ptr(A), Shape<_16, _64>{}, make_stride(64, _1{}));
  Tensor sA = make_tensor(make_smem_ptr(smem), Layout<Shape<_16, _64>, Stride<_64, _1>>{});
  GmemTiledCopy gmem_tiled_copy;
  auto gmem_thr_copy = gmem_tiled_copy.get_thread_slice(threadIdx.x);
  Tensor tAgA = gmem_thr_copy.partition_S(gA);
  Tensor tAsA = gmem_thr_copy.partition_D(sA);
  copy(gmem_tiled_copy, tAgA, tAsA);
}

This sample code generates the SASS when compiled with -arch=sm_80.

	code for sm_80
		Function : _Z6kernelPf
	.headerflags	@"EF_CUDA_TEXMODE_UNIFIED EF_CUDA_64BIT_ADDRESS EF_CUDA_SM80 EF_CUDA_VIRTUAL_SM(EF_CUDA_SM80)"
        /*0000*/                   MOV R1, c[0x0][0x28] ;                             /* 0x00000a0000017a02 */
                                                                                      /* 0x000fc40000000f00 */
        /*0010*/                   S2R R5, SR_TID.X ;                                 /* 0x0000000000057919 */
                                                                                      /* 0x000e220000002100 */
        /*0020*/                   HFMA2.MMA R3, -RZ, RZ, 0, 2.384185791015625e-07 ;  /* 0x00000004ff037435 */
                                                                                      /* 0x000fe200000001ff */
        /*0030*/                   ULDC.64 UR4, c[0x0][0x118] ;                       /* 0x0000460000047ab9 */
                                                                                      /* 0x000fe20000000a00 */
        /*0040*/                   SHF.L.U32 R2, R5.reuse, 0x2, RZ ;                  /* 0x0000000205027819 */
                                                                                      /* 0x041fe400000006ff */
        /*0050*/                   SHF.L.U32 R5, R5, 0x4, RZ ;                        /* 0x0000000405057819 */
                                                                                      /* 0x000fcc00000006ff */
        /*0060*/                   IMAD.WIDE.U32 R2, R2, R3, c[0x0][0x160] ;          /* 0x0000580002027625 */
                                                                                      /* 0x000fca00078e0003 */
        /*0070*/                   LDGSTS.E.LTC128B.128 [R5], [R2.64] ;               /* 0x0000000002057fae */
                                                                                      /* 0x000fe2000b921d44 */
        /*0080*/                   EXIT ;                                             /* 0x000000000000794d */
                                                                                      /* 0x000fea0003800000 */
        /*0090*/                   BRA 0x90;                                          /* 0xfffffff000007947 */
                                                                                      /* 0x000fc0000383ffff */
        /*00a0*/                   NOP;                                               /* 0x0000000000007918 */
                                                                                      /* 0x000fc00000000000 */
        /*00b0*/                   NOP;                                               /* 0x0000000000007918 */

Expected behavior
Unless the SM80_CP_ASYNC_* is explictly specified in copy_atom, it should not generate the LDGSTS instruction.

@cctry cctry added ? - Needs Triage bug Something isn't working labels Dec 4, 2023
@thakkarV
Copy link
Collaborator

thakkarV commented Dec 4, 2023

@ccecka I thought of this a year and half ago, but never brought it up. We really should not auto dispatch to LDGSTS on SM80 ...

Copy link

github-actions bot commented Jan 3, 2024

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

Copy link

github-actions bot commented Apr 3, 2024

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working inactive-30d inactive-90d
Projects
None yet
Development

No branches or pull requests

2 participants