[AMDGPU] efficiently wait for direct loads to LDS at all scopes #147258

ssahasra · 2025-07-07T09:13:14Z

Currently, the memory legalizer does not generate any wait on vmcnt at workgroup
scope. This is incorrect because direct loads to LDS are tracked using vmcnt and
they need to be released properly at workgroup scope.

The memory legalizer was previously updated to always emit a soft wait
instruction even when all counts are trivially ~0. SIInsertWaitcnts now examines
pending loads to LDS at each S_WAITCNT_soft instruction. If such instructions
exist, the vmcnt (which could be ~0) is upgraded to a value that wiats for any
such pending loads to LDS. After that, any soft instruction that has only
trivial ~0 counts is automatically dropped.

Thus, common programs that do not use direct loads to LDS remain unaffected, but
programs that do use such loads see a correct and efficient vmcnt even at
workgroup scope.

llvmbot · 2025-07-07T09:13:47Z

@llvm/pr-subscribers-backend-amdgpu

Author: Sameer Sahasrabuddhe (ssahasra)

Changes

Currently, the memory legalizer does not generate any wait on vmcnt at workgroup
scope. This is incorrect because direct loads to LDS are tracked using vmcnt and
they need to be released properly at workgroup scope.

The memory legalizer was previously updated to always emit a soft wait
instruction even when all counts are trivially ~0. SIInsertWaitcnts now examines
pending loads to LDS at each S_WAITCNT_soft instruction. If such instructions
exist, the vmcnt (which could be ~0) is upgraded to a value that wiats for any
such pending loads to LDS. After that, any soft instruction that has only
trivial ~0 counts is automatically dropped.

Thus, common programs that do not use direct loads to LDS remain unaffected, but
programs that do use such loads see a correct and efficient vmcnt even at
workgroup scope.

Patch is 22.89 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/147258.diff

2 Files Affected:

(modified) llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp (+13)
(added) llvm/test/CodeGen/AMDGPU/lds-dma-workgroup-release.ll (+482)

diff --git a/llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp b/llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp
index 7ce1359f03da6..b57cfe5d6f2c5 100644
--- a/llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp
+++ b/llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp
@@ -1374,6 +1374,19 @@ bool WaitcntGeneratorPreGFX12::applyPreexistingWaitcnt(
         ScoreBrackets.simplifyWaitcnt(OldWait);
       Wait = Wait.combined(OldWait);
 
+      if (!WaitcntInstr && II.getOpcode() == AMDGPU::S_WAITCNT_soft) {
+        // Each direct load to LDS is also a store to LDS, but we do not have a
+        // separate counter for it. Instead these operations increment LOAD_CNT
+        // and need to be waited for at a release fence. So we treat a release
+        // fence as if it depends on any previous LDS DMA stores.
+        //
+        // Note that a user-specified S_WAITCNT instruction is not affected; we
+        // only check for S_WAITCNT_soft since that represents a fence.
+        //
+        // FIXME: How does one detect that a soft wait is a release???
+        ScoreBrackets.determineWait(LOAD_CNT, FIRST_LDS_VGPR, Wait);
+      }
+
       // Merge consecutive waitcnt of the same type by erasing multiples.
       if (WaitcntInstr || (!Wait.hasWaitExceptStoreCnt() && TrySimplify)) {
         II.eraseFromParent();
diff --git a/llvm/test/CodeGen/AMDGPU/lds-dma-workgroup-release.ll b/llvm/test/CodeGen/AMDGPU/lds-dma-workgroup-release.ll
new file mode 100644
index 0000000000000..882c43b41bac8
--- /dev/null
+++ b/llvm/test/CodeGen/AMDGPU/lds-dma-workgroup-release.ll
@@ -0,0 +1,482 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5
+; RUN: llc -mtriple=amdgcn -mcpu=gfx900 < %s | FileCheck %s --check-prefixes=GFX900
+; RUN: llc -mtriple=amdgcn -mcpu=gfx90a < %s | FileCheck %s --check-prefixes=GFX90A
+; RUN: llc -mtriple=amdgcn -mcpu=gfx90a -mattr=+tgsplit < %s | FileCheck %s --check-prefixes=GFX90A-TGSPLIT
+; RUN: llc -mtriple=amdgcn -mcpu=gfx942 < %s | FileCheck %s --check-prefixes=GFX942
+; RUN: llc -mtriple=amdgcn -mcpu=gfx942 -mattr=+tgsplit < %s | FileCheck %s --check-prefixes=GFX942-TGSPLIT
+; RUN: llc -mtriple=amdgcn -mcpu=gfx1010 < %s | FileCheck %s --check-prefixes=GFX1010
+
+; In each of these tests, an LDS DMA operation is followed by a release pattern
+; at workgroup scope. The fence in such a release (implicit or explicit) should
+; wait for the store component in the LDS DMA. The additional noalias metadata
+; is just meant to ensure that the wait counts are not generated due to some
+; unintended aliasing.
+
+declare void @llvm.amdgcn.raw.buffer.load.lds(<4 x i32> %rsrc, ptr addrspace(3) nocapture, i32 %size, i32 %voffset, i32 %soffset, i32 %offset, i32 %aux)
+
+define amdgpu_kernel void @barrier_release(<4 x i32> inreg %rsrc,
+; GFX900-LABEL: barrier_release:
+; GFX900:       ; %bb.0: ; %main_body
+; GFX900-NEXT:    s_load_dwordx8 s[8:15], s[4:5], 0x24
+; GFX900-NEXT:    v_mov_b32_e32 v0, 0x800
+; GFX900-NEXT:    v_mov_b32_e32 v1, 0
+; GFX900-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX900-NEXT:    s_mov_b32 m0, s12
+; GFX900-NEXT:    s_nop 0
+; GFX900-NEXT:    buffer_load_dword v0, s[8:11], 0 offen lds
+; GFX900-NEXT:    v_mov_b32_e32 v0, s13
+; GFX900-NEXT:    s_waitcnt vmcnt(0)
+; GFX900-NEXT:    s_barrier
+; GFX900-NEXT:    ds_read_b32 v0, v0
+; GFX900-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX900-NEXT:    global_store_dword v1, v0, s[14:15]
+; GFX900-NEXT:    s_endpgm
+;
+; GFX90A-LABEL: barrier_release:
+; GFX90A:       ; %bb.1:
+; GFX90A-NEXT:    s_load_dwordx4 s[8:11], s[4:5], 0x0
+; GFX90A-NEXT:    s_load_dwordx2 s[12:13], s[4:5], 0x10
+; GFX90A-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX90A-NEXT:    s_branch .LBB0_0
+; GFX90A-NEXT:    .p2align 8
+; GFX90A-NEXT:  ; %bb.2:
+; GFX90A-NEXT:  .LBB0_0: ; %main_body
+; GFX90A-NEXT:    s_mov_b32 m0, s12
+; GFX90A-NEXT:    v_mov_b32_e32 v0, 0x800
+; GFX90A-NEXT:    buffer_load_dword v0, s[8:11], 0 offen lds
+; GFX90A-NEXT:    v_mov_b32_e32 v0, s13
+; GFX90A-NEXT:    s_load_dwordx2 s[0:1], s[4:5], 0x3c
+; GFX90A-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
+; GFX90A-NEXT:    s_barrier
+; GFX90A-NEXT:    ds_read_b32 v0, v0
+; GFX90A-NEXT:    v_mov_b32_e32 v1, 0
+; GFX90A-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX90A-NEXT:    global_store_dword v1, v0, s[0:1]
+; GFX90A-NEXT:    s_endpgm
+;
+; GFX90A-TGSPLIT-LABEL: barrier_release:
+; GFX90A-TGSPLIT:       ; %bb.1:
+; GFX90A-TGSPLIT-NEXT:    s_load_dwordx4 s[8:11], s[4:5], 0x0
+; GFX90A-TGSPLIT-NEXT:    s_load_dwordx2 s[12:13], s[4:5], 0x10
+; GFX90A-TGSPLIT-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX90A-TGSPLIT-NEXT:    s_branch .LBB0_0
+; GFX90A-TGSPLIT-NEXT:    .p2align 8
+; GFX90A-TGSPLIT-NEXT:  ; %bb.2:
+; GFX90A-TGSPLIT-NEXT:  .LBB0_0: ; %main_body
+; GFX90A-TGSPLIT-NEXT:    s_mov_b32 m0, s12
+; GFX90A-TGSPLIT-NEXT:    v_mov_b32_e32 v0, 0x800
+; GFX90A-TGSPLIT-NEXT:    buffer_load_dword v0, s[8:11], 0 offen lds
+; GFX90A-TGSPLIT-NEXT:    v_mov_b32_e32 v0, s13
+; GFX90A-TGSPLIT-NEXT:    s_load_dwordx2 s[0:1], s[4:5], 0x3c
+; GFX90A-TGSPLIT-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
+; GFX90A-TGSPLIT-NEXT:    s_barrier
+; GFX90A-TGSPLIT-NEXT:    buffer_wbinvl1_vol
+; GFX90A-TGSPLIT-NEXT:    ds_read_b32 v0, v0
+; GFX90A-TGSPLIT-NEXT:    v_mov_b32_e32 v1, 0
+; GFX90A-TGSPLIT-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX90A-TGSPLIT-NEXT:    global_store_dword v1, v0, s[0:1]
+; GFX90A-TGSPLIT-NEXT:    s_endpgm
+;
+; GFX942-LABEL: barrier_release:
+; GFX942:       ; %bb.1:
+; GFX942-NEXT:    s_load_dwordx4 s[8:11], s[4:5], 0x0
+; GFX942-NEXT:    s_load_dwordx2 s[12:13], s[4:5], 0x10
+; GFX942-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX942-NEXT:    s_branch .LBB0_0
+; GFX942-NEXT:    .p2align 8
+; GFX942-NEXT:  ; %bb.2:
+; GFX942-NEXT:  .LBB0_0: ; %main_body
+; GFX942-NEXT:    s_mov_b32 m0, s12
+; GFX942-NEXT:    v_mov_b32_e32 v0, 0x800
+; GFX942-NEXT:    buffer_load_dword v0, s[8:11], 0 offen lds
+; GFX942-NEXT:    v_mov_b32_e32 v0, s13
+; GFX942-NEXT:    s_load_dwordx2 s[0:1], s[4:5], 0x3c
+; GFX942-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
+; GFX942-NEXT:    s_barrier
+; GFX942-NEXT:    ds_read_b32 v0, v0
+; GFX942-NEXT:    v_mov_b32_e32 v1, 0
+; GFX942-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX942-NEXT:    global_store_dword v1, v0, s[0:1]
+; GFX942-NEXT:    s_endpgm
+;
+; GFX942-TGSPLIT-LABEL: barrier_release:
+; GFX942-TGSPLIT:       ; %bb.1:
+; GFX942-TGSPLIT-NEXT:    s_load_dwordx4 s[8:11], s[4:5], 0x0
+; GFX942-TGSPLIT-NEXT:    s_load_dwordx2 s[12:13], s[4:5], 0x10
+; GFX942-TGSPLIT-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX942-TGSPLIT-NEXT:    s_branch .LBB0_0
+; GFX942-TGSPLIT-NEXT:    .p2align 8
+; GFX942-TGSPLIT-NEXT:  ; %bb.2:
+; GFX942-TGSPLIT-NEXT:  .LBB0_0: ; %main_body
+; GFX942-TGSPLIT-NEXT:    s_mov_b32 m0, s12
+; GFX942-TGSPLIT-NEXT:    v_mov_b32_e32 v0, 0x800
+; GFX942-TGSPLIT-NEXT:    buffer_load_dword v0, s[8:11], 0 offen lds
+; GFX942-TGSPLIT-NEXT:    v_mov_b32_e32 v0, s13
+; GFX942-TGSPLIT-NEXT:    s_load_dwordx2 s[0:1], s[4:5], 0x3c
+; GFX942-TGSPLIT-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
+; GFX942-TGSPLIT-NEXT:    s_barrier
+; GFX942-TGSPLIT-NEXT:    buffer_inv sc0
+; GFX942-TGSPLIT-NEXT:    ds_read_b32 v0, v0
+; GFX942-TGSPLIT-NEXT:    v_mov_b32_e32 v1, 0
+; GFX942-TGSPLIT-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX942-TGSPLIT-NEXT:    global_store_dword v1, v0, s[0:1]
+; GFX942-TGSPLIT-NEXT:    s_endpgm
+;
+; GFX1010-LABEL: barrier_release:
+; GFX1010:       ; %bb.0: ; %main_body
+; GFX1010-NEXT:    s_load_dwordx8 s[8:15], s[4:5], 0x24
+; GFX1010-NEXT:    v_mov_b32_e32 v0, 0x800
+; GFX1010-NEXT:    v_mov_b32_e32 v1, 0
+; GFX1010-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX1010-NEXT:    s_mov_b32 m0, s12
+; GFX1010-NEXT:    buffer_load_dword v0, s[8:11], 0 offen lds
+; GFX1010-NEXT:    v_mov_b32_e32 v0, s13
+; GFX1010-NEXT:    s_waitcnt vmcnt(0)
+; GFX1010-NEXT:    s_barrier
+; GFX1010-NEXT:    buffer_gl0_inv
+; GFX1010-NEXT:    ds_read_b32 v0, v0
+; GFX1010-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX1010-NEXT:    global_store_dword v1, v0, s[14:15]
+; GFX1010-NEXT:    s_endpgm
+                                           ptr addrspace(3) inreg %lds1,
+                                           ptr addrspace(3) inreg %lds2,
+                                           ptr addrspace(1) %dummy2) {
+main_body:
+  call void @llvm.amdgcn.raw.buffer.load.lds(<4 x i32> %rsrc, ptr addrspace(3) %lds1, i32 4, i32 2048, i32 0, i32 0, i32 0), !alias.scope !102
+  fence syncscope("workgroup") release
+  tail call void @llvm.amdgcn.s.barrier()
+  fence syncscope("workgroup") acquire
+  %load = load i32, ptr addrspace(3) %lds2, align 4, !noalias !105
+  store i32 %load, ptr addrspace(1) %dummy2, align 4, !noalias !105
+  ret void
+}
+
+define amdgpu_kernel void @fence_fence(<4 x i32> inreg %rsrc,
+; GFX900-LABEL: fence_fence:
+; GFX900:       ; %bb.0: ; %main_body
+; GFX900-NEXT:    s_load_dwordx2 s[6:7], s[4:5], 0x34
+; GFX900-NEXT:    s_load_dwordx4 s[0:3], s[4:5], 0x24
+; GFX900-NEXT:    s_load_dwordx4 s[8:11], s[4:5], 0x3c
+; GFX900-NEXT:    v_mov_b32_e32 v1, 0x800
+; GFX900-NEXT:    v_mov_b32_e32 v0, 0
+; GFX900-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX900-NEXT:    s_mov_b32 m0, s6
+; GFX900-NEXT:    s_nop 0
+; GFX900-NEXT:    buffer_load_dword v1, s[0:3], 0 offen lds
+; GFX900-NEXT:    v_mov_b32_e32 v1, 1
+; GFX900-NEXT:    s_waitcnt vmcnt(0)
+; GFX900-NEXT:    global_store_dword v0, v1, s[8:9]
+; GFX900-NEXT:    global_load_dword v1, v0, s[8:9]
+; GFX900-NEXT:    s_waitcnt vmcnt(0)
+; GFX900-NEXT:    v_mov_b32_e32 v1, s7
+; GFX900-NEXT:    ds_read_b32 v1, v1
+; GFX900-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX900-NEXT:    global_store_dword v0, v1, s[10:11]
+; GFX900-NEXT:    s_endpgm
+;
+; GFX90A-LABEL: fence_fence:
+; GFX90A:       ; %bb.1:
+; GFX90A-NEXT:    s_load_dwordx4 s[8:11], s[4:5], 0x0
+; GFX90A-NEXT:    s_load_dwordx2 s[12:13], s[4:5], 0x10
+; GFX90A-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX90A-NEXT:    s_branch .LBB1_0
+; GFX90A-NEXT:    .p2align 8
+; GFX90A-NEXT:  ; %bb.2:
+; GFX90A-NEXT:  .LBB1_0: ; %main_body
+; GFX90A-NEXT:    s_load_dwordx4 s[0:3], s[4:5], 0x3c
+; GFX90A-NEXT:    s_mov_b32 m0, s12
+; GFX90A-NEXT:    v_mov_b32_e32 v1, 0x800
+; GFX90A-NEXT:    v_mov_b32_e32 v0, 0
+; GFX90A-NEXT:    buffer_load_dword v1, s[8:11], 0 offen lds
+; GFX90A-NEXT:    v_mov_b32_e32 v1, 1
+; GFX90A-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
+; GFX90A-NEXT:    global_store_dword v0, v1, s[0:1]
+; GFX90A-NEXT:    global_load_dword v1, v0, s[0:1]
+; GFX90A-NEXT:    s_waitcnt vmcnt(0)
+; GFX90A-NEXT:    v_mov_b32_e32 v1, s13
+; GFX90A-NEXT:    ds_read_b32 v1, v1
+; GFX90A-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX90A-NEXT:    global_store_dword v0, v1, s[2:3]
+; GFX90A-NEXT:    s_endpgm
+;
+; GFX90A-TGSPLIT-LABEL: fence_fence:
+; GFX90A-TGSPLIT:       ; %bb.1:
+; GFX90A-TGSPLIT-NEXT:    s_load_dwordx4 s[8:11], s[4:5], 0x0
+; GFX90A-TGSPLIT-NEXT:    s_load_dwordx2 s[12:13], s[4:5], 0x10
+; GFX90A-TGSPLIT-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX90A-TGSPLIT-NEXT:    s_branch .LBB1_0
+; GFX90A-TGSPLIT-NEXT:    .p2align 8
+; GFX90A-TGSPLIT-NEXT:  ; %bb.2:
+; GFX90A-TGSPLIT-NEXT:  .LBB1_0: ; %main_body
+; GFX90A-TGSPLIT-NEXT:    s_load_dwordx4 s[0:3], s[4:5], 0x3c
+; GFX90A-TGSPLIT-NEXT:    s_mov_b32 m0, s12
+; GFX90A-TGSPLIT-NEXT:    v_mov_b32_e32 v1, 0x800
+; GFX90A-TGSPLIT-NEXT:    v_mov_b32_e32 v0, 0
+; GFX90A-TGSPLIT-NEXT:    buffer_load_dword v1, s[8:11], 0 offen lds
+; GFX90A-TGSPLIT-NEXT:    v_mov_b32_e32 v1, 1
+; GFX90A-TGSPLIT-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
+; GFX90A-TGSPLIT-NEXT:    global_store_dword v0, v1, s[0:1]
+; GFX90A-TGSPLIT-NEXT:    global_load_dword v1, v0, s[0:1] glc
+; GFX90A-TGSPLIT-NEXT:    s_waitcnt vmcnt(0)
+; GFX90A-TGSPLIT-NEXT:    v_mov_b32_e32 v1, s13
+; GFX90A-TGSPLIT-NEXT:    buffer_wbinvl1_vol
+; GFX90A-TGSPLIT-NEXT:    ds_read_b32 v1, v1
+; GFX90A-TGSPLIT-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX90A-TGSPLIT-NEXT:    global_store_dword v0, v1, s[2:3]
+; GFX90A-TGSPLIT-NEXT:    s_endpgm
+;
+; GFX942-LABEL: fence_fence:
+; GFX942:       ; %bb.1:
+; GFX942-NEXT:    s_load_dwordx4 s[8:11], s[4:5], 0x0
+; GFX942-NEXT:    s_load_dwordx2 s[12:13], s[4:5], 0x10
+; GFX942-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX942-NEXT:    s_branch .LBB1_0
+; GFX942-NEXT:    .p2align 8
+; GFX942-NEXT:  ; %bb.2:
+; GFX942-NEXT:  .LBB1_0: ; %main_body
+; GFX942-NEXT:    s_load_dwordx4 s[0:3], s[4:5], 0x3c
+; GFX942-NEXT:    s_mov_b32 m0, s12
+; GFX942-NEXT:    v_mov_b32_e32 v1, 0x800
+; GFX942-NEXT:    v_mov_b32_e32 v0, 0
+; GFX942-NEXT:    buffer_load_dword v1, s[8:11], 0 offen lds
+; GFX942-NEXT:    v_mov_b32_e32 v1, 1
+; GFX942-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
+; GFX942-NEXT:    global_store_dword v0, v1, s[0:1] sc0
+; GFX942-NEXT:    global_load_dword v1, v0, s[0:1] sc0
+; GFX942-NEXT:    s_waitcnt vmcnt(0)
+; GFX942-NEXT:    v_mov_b32_e32 v1, s13
+; GFX942-NEXT:    ds_read_b32 v1, v1
+; GFX942-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX942-NEXT:    global_store_dword v0, v1, s[2:3]
+; GFX942-NEXT:    s_endpgm
+;
+; GFX942-TGSPLIT-LABEL: fence_fence:
+; GFX942-TGSPLIT:       ; %bb.1:
+; GFX942-TGSPLIT-NEXT:    s_load_dwordx4 s[8:11], s[4:5], 0x0
+; GFX942-TGSPLIT-NEXT:    s_load_dwordx2 s[12:13], s[4:5], 0x10
+; GFX942-TGSPLIT-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX942-TGSPLIT-NEXT:    s_branch .LBB1_0
+; GFX942-TGSPLIT-NEXT:    .p2align 8
+; GFX942-TGSPLIT-NEXT:  ; %bb.2:
+; GFX942-TGSPLIT-NEXT:  .LBB1_0: ; %main_body
+; GFX942-TGSPLIT-NEXT:    s_load_dwordx4 s[0:3], s[4:5], 0x3c
+; GFX942-TGSPLIT-NEXT:    s_mov_b32 m0, s12
+; GFX942-TGSPLIT-NEXT:    v_mov_b32_e32 v1, 0x800
+; GFX942-TGSPLIT-NEXT:    v_mov_b32_e32 v0, 0
+; GFX942-TGSPLIT-NEXT:    buffer_load_dword v1, s[8:11], 0 offen lds
+; GFX942-TGSPLIT-NEXT:    v_mov_b32_e32 v1, 1
+; GFX942-TGSPLIT-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
+; GFX942-TGSPLIT-NEXT:    global_store_dword v0, v1, s[0:1] sc0
+; GFX942-TGSPLIT-NEXT:    global_load_dword v1, v0, s[0:1] sc0
+; GFX942-TGSPLIT-NEXT:    s_waitcnt vmcnt(0)
+; GFX942-TGSPLIT-NEXT:    v_mov_b32_e32 v1, s13
+; GFX942-TGSPLIT-NEXT:    buffer_inv sc0
+; GFX942-TGSPLIT-NEXT:    ds_read_b32 v1, v1
+; GFX942-TGSPLIT-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX942-TGSPLIT-NEXT:    global_store_dword v0, v1, s[2:3]
+; GFX942-TGSPLIT-NEXT:    s_endpgm
+;
+; GFX1010-LABEL: fence_fence:
+; GFX1010:       ; %bb.0: ; %main_body
+; GFX1010-NEXT:    s_clause 0x2
+; GFX1010-NEXT:    s_load_dwordx2 s[6:7], s[4:5], 0x34
+; GFX1010-NEXT:    s_load_dwordx4 s[0:3], s[4:5], 0x24
+; GFX1010-NEXT:    s_load_dwordx4 s[8:11], s[4:5], 0x3c
+; GFX1010-NEXT:    v_mov_b32_e32 v0, 0x800
+; GFX1010-NEXT:    v_mov_b32_e32 v1, 0
+; GFX1010-NEXT:    v_mov_b32_e32 v2, 1
+; GFX1010-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX1010-NEXT:    s_mov_b32 m0, s6
+; GFX1010-NEXT:    buffer_load_dword v0, s[0:3], 0 offen lds
+; GFX1010-NEXT:    s_waitcnt vmcnt(0)
+; GFX1010-NEXT:    global_store_dword v1, v2, s[8:9]
+; GFX1010-NEXT:    global_load_dword v0, v1, s[8:9] glc
+; GFX1010-NEXT:    s_waitcnt vmcnt(0)
+; GFX1010-NEXT:    v_mov_b32_e32 v0, s7
+; GFX1010-NEXT:    s_waitcnt_vscnt null, 0x0
+; GFX1010-NEXT:    buffer_gl0_inv
+; GFX1010-NEXT:    ds_read_b32 v0, v0
+; GFX1010-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX1010-NEXT:    global_store_dword v1, v0, s[10:11]
+; GFX1010-NEXT:    s_endpgm
+                                       ptr addrspace(3) inreg %lds1,
+                                       ptr addrspace(3) inreg %lds2,
+                                       ptr addrspace(1) %flag,
+                                       ptr addrspace(1) %dummy2) {
+main_body:
+  call void @llvm.amdgcn.raw.buffer.load.lds(<4 x i32> %rsrc, ptr addrspace(3) %lds1, i32 4, i32 2048, i32 0, i32 0, i32 0), !alias.scope !102
+  fence syncscope("workgroup") release
+  store atomic i32 1, ptr addrspace(1) %flag syncscope("workgroup") monotonic, align 4, !noalias !105
+  %unused_flag = load atomic i32, ptr addrspace(1) %flag syncscope("workgroup") monotonic, align 4, !noalias !105
+  fence syncscope("workgroup") acquire
+  %load = load i32, ptr addrspace(3) %lds2, align 4, !noalias !105
+  store i32 %load, ptr addrspace(1) %dummy2, align 4, !noalias !105
+  ret void
+}
+
+define amdgpu_kernel void @release_acquire(<4 x i32> inreg %rsrc,
+; GFX900-LABEL: release_acquire:
+; GFX900:       ; %bb.0: ; %main_body
+; GFX900-NEXT:    s_load_dwordx2 s[6:7], s[4:5], 0x34
+; GFX900-NEXT:    s_load_dwordx4 s[0:3], s[4:5], 0x24
+; GFX900-NEXT:    s_load_dwordx4 s[8:11], s[4:5], 0x3c
+; GFX900-NEXT:    v_mov_b32_e32 v1, 0x800
+; GFX900-NEXT:    v_mov_b32_e32 v0, 0
+; GFX900-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX900-NEXT:    s_mov_b32 m0, s6
+; GFX900-NEXT:    s_nop 0
+; GFX900-NEXT:    buffer_load_dword v1, s[0:3], 0 offen lds
+; GFX900-NEXT:    v_mov_b32_e32 v1, 1
+; GFX900-NEXT:    s_waitcnt vmcnt(0)
+; GFX900-NEXT:    global_store_dword v0, v1, s[8:9]
+; GFX900-NEXT:    global_load_dword v1, v0, s[8:9]
+; GFX900-NEXT:    s_waitcnt vmcnt(0)
+; GFX900-NEXT:    v_mov_b32_e32 v1, s7
+; GFX900-NEXT:    ds_read_b32 v1, v1
+; GFX900-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX900-NEXT:    global_store_dword v0, v1, s[10:11]
+; GFX900-NEXT:    s_endpgm
+;
+; GFX90A-LABEL: release_acquire:
+; GFX90A:       ; %bb.1:
+; GFX90A-NEXT:    s_load_dwordx4 s[8:11], s[4:5], 0x0
+; GFX90A-NEXT:    s_load_dwordx2 s[12:13], s[4:5], 0x10
+; GFX90A-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX90A-NEXT:    s_branch .LBB2_0
+; GFX90A-NEXT:    .p2align 8
+; GFX90A-NEXT:  ; %bb.2:
+; GFX90A-NEXT:  .LBB2_0: ; %main_body
+; GFX90A-NEXT:    s_load_dwordx4 s[0:3], s[4:5], 0x3c
+; GFX90A-NEXT:    s_mov_b32 m0, s12
+; GFX90A-NEXT:    v_mov_b32_e32 v1, 0x800
+; GFX90A-NEXT:    v_mov_b32_e32 v0, 0
+; GFX90A-NEXT:    buffer_load_dword v1, s[8:11], 0 offen lds
+; GFX90A-NEXT:    v_mov_b32_e32 v1, 1
+; GFX90A-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
+; GFX90A-NEXT:    global_store_dword v0, v1, s[0:1]
+; GFX90A-NEXT:    global_load_dword v1, v0, s[0:1]
+; GFX90A-NEXT:    s_waitcnt vmcnt(0)
+; GFX90A-NEXT:    v_mov_b32_e32 v1, s13
+; GFX90A-NEXT:    ds_read_b32 v1, v1
+; GFX90A-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX90A-NEXT:    global_store_dword v0, v1, s[2:3]
+; GFX90A-NEXT:    s_endpgm
+;
+; GFX90A-TGSPLIT-LABEL: release_acquire:
+; GFX90A-TGSPLIT:       ; %bb.1:
+; GFX90A-TGSPLIT-NEXT:    s_load_dwordx4 s[8:11], s[4:5], 0x0
+; GFX90A-TGSPLIT-NEXT:    s_load_dwordx2 s[12:13], s[4:5], 0x10
+; GFX90A-TGSPLIT-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX90A-TGSPLIT-NEXT:    s_branch .LBB2_0
+; GFX90A-TGSPLIT-NEXT:    .p2align 8
+; GFX90A-TGSPLIT-NEXT:  ; %bb.2:
+; GFX90A-TGSPLIT-NEXT:  .LBB2_0: ; %main_body
+; GFX90A-TGSPLIT-NEXT:    s_load_dwordx4 s[0:3], s[4:5], 0x3c
+; GFX90A-TGSPLIT-NEXT:    s_mov_b32 m0, s12
+; GFX90A-TGSPLIT-NEXT:    v_mov_b32_e32 v1, 0x800
+; GFX90A-TGSPLIT-NEXT:    v_mov_b32_e32 v0, 0
+; GFX90A-TGSPLIT-NEXT:    buffer_load_dword v1, s[8:11], 0 offen lds
+; GFX90A-TGSPLIT-NEXT:    v_mov_b32_e32 v1, 1
+; GFX90A-TGSPLIT-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
+; GFX90A-TGSPLIT-NEXT:    global_store_dword v0, v1, s[0:1]
+; GFX90A-TGSPLIT-NEXT:    global_load_dword v1, v0, s[0:1] glc
+; GFX90A-TGSPLIT-NEXT:    s_waitcnt vmcnt(0)
+; GFX90A-TGSPLIT-NEXT:    buffer_wbinvl1_vol
+; GFX90A-TGSPLIT-NEXT:    v_mov_b32_e32 v1, s13
+; GFX90A-TGSPLIT-NEXT:    ds_read_b32 v1, v1
+; GFX90A-TGSPLIT-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX90A-TGSPLIT-NEXT:    global_store_dword v0, v1, s[2:3]
+; GFX90A-TGSPLIT-NEXT:    s_endpgm
+;
+; GFX942-LABEL: release_acquire:
+; GFX942:       ; %bb.1:
+; GFX942-NEXT:    s_load_dwordx4 s[8:11], s[4:5], 0x0
+; GFX942-NEXT:    s_load_dwordx2 s[12:13], s[4:5], 0x10
+; GFX942-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX942-NEXT:    s_branch .LBB2_0
+; GFX942-NEXT:    .p2align 8
+; GFX942-NEXT:  ; %bb.2:
+; GFX942-NEXT:  .LBB2_0: ; %main_body
+; GFX942-NEXT:    s_load_dwordx4 s[0:3], s[4:5], 0x3c
+; GFX942-NEXT:    s_mov_b32 m0, s12
+; GFX942-NEXT:    v_mov_b32_e32 v1, 0x800
+; GFX942-NEXT:    v_mov_b32_e32 v0, 0
+; GFX942-NEXT:    buffer_load_dword v1, s[8:11], 0 offen lds
+; GFX942-NEXT:    v_mov_b32_e32 v1, 1
+; GFX942-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
+; GFX942-NEXT:    global_store_dword v0, v1, s[0:1] sc0
+; GFX942-NEXT:    global_load_dword v1, v0, s[0:1] sc0
+; GFX942-NEXT:    s_waitcnt vmcnt(0)
+; GFX942-NEXT:    v_mov_b32_e32 v1, s13
+; GFX942-NEXT:    ds_read_b32 v1, v1
+; GFX942-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX942-NEXT:    global_store_dword v0, v1, s[2:3]
+; GFX942-NEXT:    s_endpgm
+;
+; GFX942-TGSPLIT-LABEL: release_acquire:
+; GFX942-TGSPLIT:       ; %bb.1:
+; GFX942-TGSPLIT-NEXT:    s_load_dwordx4 s[8:11], s[4:5], 0x0
+; GFX942-TGSPLIT-NEXT:    s_load_dwordx2...
[truncated]

ssahasra · 2025-07-07T09:16:19Z

This is part of a stack:

Currently, the memory legalizer does not generate any wait on vmcnt at workgroup scope. This is incorrect because direct loads to LDS are tracked using vmcnt and they need to be released properly at workgroup scope. The memory legalizer was previously updated to always emit a soft wait instruction even when all counts are trivially ~0. SIInsertWaitcnts now examines pending loads to LDS at each S_WAITCNT_soft instruction. If such instructions exist, the vmcnt (which could be ~0) is upgraded to a value that wiats for any such pending loads to LDS. After that, any soft instruction that has only trivial ~0 counts is automatically dropped. Thus, common programs that do not use direct loads to LDS remain unaffected, but programs that do use such loads see a correct and efficient vmcnt even at workgroup scope.

ssahasra · 2025-07-09T08:26:43Z

Note that the best way to see the effect of this PR is to view only the second diff of the two in this PR. It shows how the missing vmcnt(0) shows up in the new test introduced by the first commit.

ssahasra requested review from jayfoad, arsenm, kerbowa, t-tye and Pierre-vh July 7, 2025 09:13

llvmbot added the backend:AMDGPU label Jul 7, 2025

This was referenced Jul 7, 2025

[AMDGPU] auto update some tests to prepare for future changes #147256

Merged

[AMDGPU] always emit a soft wait even if it is trivially ~0 #147257

Open

ssahasra added 2 commits July 9, 2025 12:57

[AMDGCN] pre-checkin test for LDS DMA and release operations

95ffad8

ssahasra force-pushed the users/ssahasra/waitcnt-always-emit branch from 6e30f37 to e6831c6 Compare July 9, 2025 08:18

ssahasra force-pushed the users/ssahasra/waitcnt-lds-direct branch from 9b0cdd6 to 65562b2 Compare July 9, 2025 08:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[AMDGPU] efficiently wait for direct loads to LDS at all scopes #147258

[AMDGPU] efficiently wait for direct loads to LDS at all scopes #147258

ssahasra commented Jul 7, 2025

Uh oh!

llvmbot commented Jul 7, 2025

Uh oh!

ssahasra commented Jul 7, 2025

Uh oh!

ssahasra commented Jul 9, 2025

Uh oh!

Uh oh!

[AMDGPU] efficiently wait for direct loads to LDS at all scopes #147258

Are you sure you want to change the base?

[AMDGPU] efficiently wait for direct loads to LDS at all scopes #147258

Conversation

ssahasra commented Jul 7, 2025

Uh oh!

llvmbot commented Jul 7, 2025

Uh oh!

ssahasra commented Jul 7, 2025

Uh oh!

ssahasra commented Jul 9, 2025

Uh oh!

Uh oh!