Skip to content

[AMDGPU] always emit a soft wait even if it is trivially ~0 #147257

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

ssahasra
Copy link
Collaborator

@ssahasra ssahasra commented Jul 7, 2025

The memory legalizer is currently responsible for emitting wait instructions at ordering operations such as acquire and release. It tries to be efficient by emitting waits only when required. In particular, it does not emit a wait on vmcnt at workgroup scope since that ordering is already guaranteed by the architecture. But this is now incorrect because direct loads to LDS have an LDS component which needs explicit ordering on vmcnt. But it is inefficient to always emit a wait on vmcnt since majority of the programs do not use direct loads to LDS, and this will affect all workgroup scope operations.

As a first step to that, the memory legalizer now emits a soft wait instruction even if all counts are trivially ~0. This is a placeholder that the SIInsertWaitcnts pass will either optimize away or strengthen based on its analysis of whether direct loads to LDS are pending at this point in the program.

@llvmbot
Copy link
Member

llvmbot commented Jul 7, 2025

@llvm/pr-subscribers-backend-amdgpu

Author: Sameer Sahasrabuddhe (ssahasra)

Changes

The memory legalizer is currently responsible for emitting wait instructions at ordering operations such as acquire and release. It tries to be efficient by emitting waits only when required. In particular, it does not emit a wait on vmcnt at workgroup scope since that ordering is already guaranteed by the architecture. But this is now incorrect because direct loads to LDS have an LDS component which needs explicit ordering on vmcnt. But it is inefficient to always emit a wait on vmcnt since majority of the programs do not use direct loads to LDS, and this will affect all workgroup scope operations.

As a first step to that, the memory legalizer now emits a soft wait instruction even if all counts are trivially ~0. This is a placeholder that the SIInsertWaitcnts pass will either optimize away or strenghthen based on its analysis of whether direct loads to LDS are pending at this point in the program.


Patch is 4.42 MiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/147257.diff

41 Files Affected:

  • (modified) llvm/lib/Target/AMDGPU/SIMemoryLegalizer.cpp (+25-33)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/flat-scratch.ll (+6-6)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/memory-legalizer-atomic-fence.ll (+112)
  • (modified) llvm/test/CodeGen/AMDGPU/attributor-flatscratchinit-undefined-behavior2.ll (+2-2)
  • (modified) llvm/test/CodeGen/AMDGPU/call-argument-types.ll (+2-2)
  • (modified) llvm/test/CodeGen/AMDGPU/dynamic_stackalloc.ll (+6-6)
  • (modified) llvm/test/CodeGen/AMDGPU/flat-scratch.ll (+6-6)
  • (modified) llvm/test/CodeGen/AMDGPU/function-args.ll (+66-66)
  • (modified) llvm/test/CodeGen/AMDGPU/indirect-addressing-si.ll (+2)
  • (modified) llvm/test/CodeGen/AMDGPU/kernel-vgpr-spill-mubuf-with-voffset.ll (+1)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.update.dpp.ll (+1)
  • (modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-fence-mmra-global.ll (+64)
  • (modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-fence-mmra-local.ll (+168-6)
  • (modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-fence.ll (+220-4)
  • (modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-flat-agent.ll (+160-32)
  • (modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-flat-singlethread.ll (+1420)
  • (modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-flat-system.ll (+160-32)
  • (modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-flat-volatile.ll (+14-2)
  • (modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-flat-wavefront.ll (+1410)
  • (modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-flat-workgroup.ll (+576-68)
  • (modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-global-agent.ll (+192)
  • (modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-global-singlethread.ll (+1152-52)
  • (modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-global-system.ll (+168)
  • (modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-global-volatile.ll (+14-1)
  • (modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-global-wavefront.ll (+1152-52)
  • (modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-global-workgroup.ll (+706-82)
  • (modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-local-agent.ll (+940-98)
  • (modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-local-nontemporal.ll (+3-2)
  • (modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-local-singlethread.ll (+1548)
  • (modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-local-system.ll (+940-98)
  • (modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-local-volatile.ll (+27-7)
  • (modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-local-wavefront.ll (+1548)
  • (modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-local-workgroup.ll (+940-98)
  • (modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-local.mir (+31)
  • (modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-private-volatile.ll (+12)
  • (modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-region.mir (+31)
  • (modified) llvm/test/CodeGen/AMDGPU/mubuf-legalize-operands-non-ptr-intrinsics.ll (+4-4)
  • (modified) llvm/test/CodeGen/AMDGPU/mubuf-legalize-operands.ll (+4-4)
  • (modified) llvm/test/CodeGen/AMDGPU/stacksave_stackrestore.ll (+6)
  • (modified) llvm/test/CodeGen/AMDGPU/trap-abis.ll (+5)
  • (modified) llvm/test/CodeGen/AMDGPU/wait-before-stores-with-scope_sys.mir (+1)
diff --git a/llvm/lib/Target/AMDGPU/SIMemoryLegalizer.cpp b/llvm/lib/Target/AMDGPU/SIMemoryLegalizer.cpp
index 3212060f303a5..f015d3ad7811e 100644
--- a/llvm/lib/Target/AMDGPU/SIMemoryLegalizer.cpp
+++ b/llvm/lib/Target/AMDGPU/SIMemoryLegalizer.cpp
@@ -1074,8 +1074,6 @@ bool SIGfx6CacheControl::insertWait(MachineBasicBlock::iterator &MI,
                                     SIAtomicAddrSpace AddrSpace, SIMemOp Op,
                                     bool IsCrossAddrSpaceOrdering, Position Pos,
                                     AtomicOrdering Order) const {
-  bool Changed = false;
-
   MachineBasicBlock &MBB = *MI->getParent();
   DebugLoc DL = MI->getDebugLoc();
 
@@ -1149,21 +1147,19 @@ bool SIGfx6CacheControl::insertWait(MachineBasicBlock::iterator &MI,
     }
   }
 
-  if (VMCnt || LGKMCnt) {
-    unsigned WaitCntImmediate =
-      AMDGPU::encodeWaitcnt(IV,
-                            VMCnt ? 0 : getVmcntBitMask(IV),
-                            getExpcntBitMask(IV),
-                            LGKMCnt ? 0 : getLgkmcntBitMask(IV));
-    BuildMI(MBB, MI, DL, TII->get(AMDGPU::S_WAITCNT_soft))
-        .addImm(WaitCntImmediate);
-    Changed = true;
-  }
+  // Always emit a soft wait count, even if it is trivially ~0. SIInsertWaitcnts
+  // will later use this marker to add additional waits such as those required
+  // from direct load to LDS (formerly known as LDS DMA).
+  unsigned WaitCntImmediate = AMDGPU::encodeWaitcnt(
+      IV, VMCnt ? 0 : getVmcntBitMask(IV), getExpcntBitMask(IV),
+      LGKMCnt ? 0 : getLgkmcntBitMask(IV));
+  BuildMI(MBB, MI, DL, TII->get(AMDGPU::S_WAITCNT_soft))
+      .addImm(WaitCntImmediate);
 
   if (Pos == Position::AFTER)
     --MI;
 
-  return Changed;
+  return true;
 }
 
 bool SIGfx6CacheControl::insertAcquire(MachineBasicBlock::iterator &MI,
@@ -1966,8 +1962,6 @@ bool SIGfx10CacheControl::insertWait(MachineBasicBlock::iterator &MI,
                                      SIAtomicAddrSpace AddrSpace, SIMemOp Op,
                                      bool IsCrossAddrSpaceOrdering,
                                      Position Pos, AtomicOrdering Order) const {
-  bool Changed = false;
-
   MachineBasicBlock &MBB = *MI->getParent();
   DebugLoc DL = MI->getDebugLoc();
 
@@ -2057,28 +2051,25 @@ bool SIGfx10CacheControl::insertWait(MachineBasicBlock::iterator &MI,
     }
   }
 
-  if (VMCnt || LGKMCnt) {
-    unsigned WaitCntImmediate =
-      AMDGPU::encodeWaitcnt(IV,
-                            VMCnt ? 0 : getVmcntBitMask(IV),
-                            getExpcntBitMask(IV),
-                            LGKMCnt ? 0 : getLgkmcntBitMask(IV));
-    BuildMI(MBB, MI, DL, TII->get(AMDGPU::S_WAITCNT_soft))
-        .addImm(WaitCntImmediate);
-    Changed = true;
-  }
+  // Always emit a soft wait count, even if it is trivially ~0. SIInsertWaitcnts
+  // will later use this marker to add additional waits such as those required
+  // from direct load to LDS (formerly known as LDS DMA).
+  unsigned WaitCntImmediate = AMDGPU::encodeWaitcnt(
+      IV, VMCnt ? 0 : getVmcntBitMask(IV), getExpcntBitMask(IV),
+      LGKMCnt ? 0 : getLgkmcntBitMask(IV));
+  BuildMI(MBB, MI, DL, TII->get(AMDGPU::S_WAITCNT_soft))
+      .addImm(WaitCntImmediate);
 
   if (VSCnt) {
     BuildMI(MBB, MI, DL, TII->get(AMDGPU::S_WAITCNT_VSCNT_soft))
         .addReg(AMDGPU::SGPR_NULL, RegState::Undef)
         .addImm(0);
-    Changed = true;
   }
 
   if (Pos == Position::AFTER)
     --MI;
 
-  return Changed;
+  return true;
 }
 
 bool SIGfx10CacheControl::insertAcquire(MachineBasicBlock::iterator &MI,
@@ -2287,8 +2278,6 @@ bool SIGfx12CacheControl::insertWait(MachineBasicBlock::iterator &MI,
                                      SIAtomicAddrSpace AddrSpace, SIMemOp Op,
                                      bool IsCrossAddrSpaceOrdering,
                                      Position Pos, AtomicOrdering Order) const {
-  bool Changed = false;
-
   MachineBasicBlock &MBB = *MI->getParent();
   DebugLoc DL = MI->getDebugLoc();
 
@@ -2372,23 +2361,26 @@ bool SIGfx12CacheControl::insertWait(MachineBasicBlock::iterator &MI,
       BuildMI(MBB, MI, DL, TII->get(AMDGPU::S_WAIT_SAMPLECNT_soft)).addImm(0);
     }
     BuildMI(MBB, MI, DL, TII->get(AMDGPU::S_WAIT_LOADCNT_soft)).addImm(0);
-    Changed = true;
+  } else {
+    // Always emit a soft wait count, even if it is trivially ~0.
+    // SIInsertWaitcnts will later use this marker to add additional waits such
+    // as those required from direct load to LDS (formerly known as LDS DMA).
+    BuildMI(MBB, MI, DL, TII->get(AMDGPU::S_WAIT_LOADCNT_soft))
+        .addImm(getLoadcntBitMask(IV));
   }
 
   if (STORECnt) {
     BuildMI(MBB, MI, DL, TII->get(AMDGPU::S_WAIT_STORECNT_soft)).addImm(0);
-    Changed = true;
   }
 
   if (DSCnt) {
     BuildMI(MBB, MI, DL, TII->get(AMDGPU::S_WAIT_DSCNT_soft)).addImm(0);
-    Changed = true;
   }
 
   if (Pos == Position::AFTER)
     --MI;
 
-  return Changed;
+  return true;
 }
 
 bool SIGfx12CacheControl::insertAcquire(MachineBasicBlock::iterator &MI,
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/flat-scratch.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/flat-scratch.ll
index 8a80afd4a768f..1bbbec977b714 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/flat-scratch.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/flat-scratch.ll
@@ -880,8 +880,8 @@ define amdgpu_kernel void @store_load_vindex_small_offset_kernel(i32 %n) {
 ; GFX10-NEXT:    v_lshlrev_b32_e32 v1, 2, v1
 ; GFX10-NEXT:    v_add_nc_u32_e32 v0, 0x100, v0
 ; GFX10-NEXT:    scratch_store_dword v0, v2, off offset:128
-; GFX10-NEXT:    s_waitcnt_vscnt null, 0x0
 ; GFX10-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX10-NEXT:    s_waitcnt_vscnt null, 0x0
 ; GFX10-NEXT:    s_lshl_b32 s0, s0, 7
 ; GFX10-NEXT:    s_add_u32 s0, 0x100, s0
 ; GFX10-NEXT:    v_add_nc_u32_e32 v1, s0, v1
@@ -921,8 +921,8 @@ define amdgpu_kernel void @store_load_vindex_small_offset_kernel(i32 %n) {
 ; GFX11-NEXT:    s_delay_alu instid0(VALU_DEP_2) | instskip(SKIP_4) | instid1(SALU_CYCLE_1)
 ; GFX11-NEXT:    v_lshlrev_b32_e32 v1, 2, v1
 ; GFX11-NEXT:    scratch_store_b32 v0, v2, off offset:384 dlc
-; GFX11-NEXT:    s_waitcnt_vscnt null, 0x0
 ; GFX11-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX11-NEXT:    s_waitcnt_vscnt null, 0x0
 ; GFX11-NEXT:    s_lshl_b32 s0, s0, 7
 ; GFX11-NEXT:    s_add_u32 s0, 0x100, s0
 ; GFX11-NEXT:    s_delay_alu instid0(SALU_CYCLE_1)
@@ -991,8 +991,8 @@ define amdgpu_kernel void @store_load_vindex_small_offset_kernel(i32 %n) {
 ; UNALIGNED_GFX10-NEXT:    v_lshlrev_b32_e32 v1, 2, v1
 ; UNALIGNED_GFX10-NEXT:    v_add_nc_u32_e32 v0, 0x100, v0
 ; UNALIGNED_GFX10-NEXT:    scratch_store_dword v0, v2, off offset:128
-; UNALIGNED_GFX10-NEXT:    s_waitcnt_vscnt null, 0x0
 ; UNALIGNED_GFX10-NEXT:    s_waitcnt lgkmcnt(0)
+; UNALIGNED_GFX10-NEXT:    s_waitcnt_vscnt null, 0x0
 ; UNALIGNED_GFX10-NEXT:    s_lshl_b32 s0, s0, 7
 ; UNALIGNED_GFX10-NEXT:    s_add_u32 s0, 0x100, s0
 ; UNALIGNED_GFX10-NEXT:    v_add_nc_u32_e32 v1, s0, v1
@@ -1032,8 +1032,8 @@ define amdgpu_kernel void @store_load_vindex_small_offset_kernel(i32 %n) {
 ; UNALIGNED_GFX11-NEXT:    s_delay_alu instid0(VALU_DEP_2) | instskip(SKIP_4) | instid1(SALU_CYCLE_1)
 ; UNALIGNED_GFX11-NEXT:    v_lshlrev_b32_e32 v1, 2, v1
 ; UNALIGNED_GFX11-NEXT:    scratch_store_b32 v0, v2, off offset:384 dlc
-; UNALIGNED_GFX11-NEXT:    s_waitcnt_vscnt null, 0x0
 ; UNALIGNED_GFX11-NEXT:    s_waitcnt lgkmcnt(0)
+; UNALIGNED_GFX11-NEXT:    s_waitcnt_vscnt null, 0x0
 ; UNALIGNED_GFX11-NEXT:    s_lshl_b32 s0, s0, 7
 ; UNALIGNED_GFX11-NEXT:    s_add_u32 s0, 0x100, s0
 ; UNALIGNED_GFX11-NEXT:    s_delay_alu instid0(SALU_CYCLE_1)
@@ -1520,8 +1520,8 @@ define amdgpu_kernel void @store_load_vindex_large_offset_kernel(i32 %n) {
 ; GFX10-NEXT:    v_lshlrev_b32_e32 v1, 2, v1
 ; GFX10-NEXT:    v_add_nc_u32_e32 v0, 0x4004, v0
 ; GFX10-NEXT:    scratch_store_dword v0, v2, off offset:128
-; GFX10-NEXT:    s_waitcnt_vscnt null, 0x0
 ; GFX10-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX10-NEXT:    s_waitcnt_vscnt null, 0x0
 ; GFX10-NEXT:    s_lshl_b32 s0, s0, 7
 ; GFX10-NEXT:    s_add_u32 s0, 0x4004, s0
 ; GFX10-NEXT:    v_add_nc_u32_e32 v1, s0, v1
@@ -1633,8 +1633,8 @@ define amdgpu_kernel void @store_load_vindex_large_offset_kernel(i32 %n) {
 ; UNALIGNED_GFX10-NEXT:    v_lshlrev_b32_e32 v1, 2, v1
 ; UNALIGNED_GFX10-NEXT:    v_add_nc_u32_e32 v0, 0x4004, v0
 ; UNALIGNED_GFX10-NEXT:    scratch_store_dword v0, v2, off offset:128
-; UNALIGNED_GFX10-NEXT:    s_waitcnt_vscnt null, 0x0
 ; UNALIGNED_GFX10-NEXT:    s_waitcnt lgkmcnt(0)
+; UNALIGNED_GFX10-NEXT:    s_waitcnt_vscnt null, 0x0
 ; UNALIGNED_GFX10-NEXT:    s_lshl_b32 s0, s0, 7
 ; UNALIGNED_GFX10-NEXT:    s_add_u32 s0, 0x4004, s0
 ; UNALIGNED_GFX10-NEXT:    v_add_nc_u32_e32 v1, s0, v1
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/memory-legalizer-atomic-fence.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/memory-legalizer-atomic-fence.ll
index 66037615f0ba0..ea6a5d5e74d52 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/memory-legalizer-atomic-fence.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/memory-legalizer-atomic-fence.ll
@@ -199,26 +199,32 @@ entry:
 define amdgpu_kernel void @singlethread_one_as_acquire() #0 {
   ; GFX6-LABEL: name: singlethread_one_as_acquire
   ; GFX6: bb.0.entry:
+  ; GFX6-NEXT:   S_WAITCNT_soft 3967
   ; GFX6-NEXT:   S_ENDPGM 0
   ;
   ; GFX8-LABEL: name: singlethread_one_as_acquire
   ; GFX8: bb.0.entry:
+  ; GFX8-NEXT:   S_WAITCNT_soft 3967
   ; GFX8-NEXT:   S_ENDPGM 0
   ;
   ; GFX10WGP-LABEL: name: singlethread_one_as_acquire
   ; GFX10WGP: bb.0.entry:
+  ; GFX10WGP-NEXT:   S_WAITCNT_soft 65407
   ; GFX10WGP-NEXT:   S_ENDPGM 0
   ;
   ; GFX10CU-LABEL: name: singlethread_one_as_acquire
   ; GFX10CU: bb.0.entry:
+  ; GFX10CU-NEXT:   S_WAITCNT_soft 65407
   ; GFX10CU-NEXT:   S_ENDPGM 0
   ;
   ; GFX11WGP-LABEL: name: singlethread_one_as_acquire
   ; GFX11WGP: bb.0.entry:
+  ; GFX11WGP-NEXT:   S_WAITCNT_soft 65527
   ; GFX11WGP-NEXT:   S_ENDPGM 0
   ;
   ; GFX11CU-LABEL: name: singlethread_one_as_acquire
   ; GFX11CU: bb.0.entry:
+  ; GFX11CU-NEXT:   S_WAITCNT_soft 65527
   ; GFX11CU-NEXT:   S_ENDPGM 0
 entry:
   fence syncscope("singlethread-one-as") acquire
@@ -228,26 +234,32 @@ entry:
 define amdgpu_kernel void @singlethread_one_as_release() #0 {
   ; GFX6-LABEL: name: singlethread_one_as_release
   ; GFX6: bb.0.entry:
+  ; GFX6-NEXT:   S_WAITCNT_soft 3967
   ; GFX6-NEXT:   S_ENDPGM 0
   ;
   ; GFX8-LABEL: name: singlethread_one_as_release
   ; GFX8: bb.0.entry:
+  ; GFX8-NEXT:   S_WAITCNT_soft 3967
   ; GFX8-NEXT:   S_ENDPGM 0
   ;
   ; GFX10WGP-LABEL: name: singlethread_one_as_release
   ; GFX10WGP: bb.0.entry:
+  ; GFX10WGP-NEXT:   S_WAITCNT_soft 65407
   ; GFX10WGP-NEXT:   S_ENDPGM 0
   ;
   ; GFX10CU-LABEL: name: singlethread_one_as_release
   ; GFX10CU: bb.0.entry:
+  ; GFX10CU-NEXT:   S_WAITCNT_soft 65407
   ; GFX10CU-NEXT:   S_ENDPGM 0
   ;
   ; GFX11WGP-LABEL: name: singlethread_one_as_release
   ; GFX11WGP: bb.0.entry:
+  ; GFX11WGP-NEXT:   S_WAITCNT_soft 65527
   ; GFX11WGP-NEXT:   S_ENDPGM 0
   ;
   ; GFX11CU-LABEL: name: singlethread_one_as_release
   ; GFX11CU: bb.0.entry:
+  ; GFX11CU-NEXT:   S_WAITCNT_soft 65527
   ; GFX11CU-NEXT:   S_ENDPGM 0
 entry:
   fence syncscope("singlethread-one-as") release
@@ -257,26 +269,32 @@ entry:
 define amdgpu_kernel void @singlethread_one_as_acq_rel() #0 {
   ; GFX6-LABEL: name: singlethread_one_as_acq_rel
   ; GFX6: bb.0.entry:
+  ; GFX6-NEXT:   S_WAITCNT_soft 3967
   ; GFX6-NEXT:   S_ENDPGM 0
   ;
   ; GFX8-LABEL: name: singlethread_one_as_acq_rel
   ; GFX8: bb.0.entry:
+  ; GFX8-NEXT:   S_WAITCNT_soft 3967
   ; GFX8-NEXT:   S_ENDPGM 0
   ;
   ; GFX10WGP-LABEL: name: singlethread_one_as_acq_rel
   ; GFX10WGP: bb.0.entry:
+  ; GFX10WGP-NEXT:   S_WAITCNT_soft 65407
   ; GFX10WGP-NEXT:   S_ENDPGM 0
   ;
   ; GFX10CU-LABEL: name: singlethread_one_as_acq_rel
   ; GFX10CU: bb.0.entry:
+  ; GFX10CU-NEXT:   S_WAITCNT_soft 65407
   ; GFX10CU-NEXT:   S_ENDPGM 0
   ;
   ; GFX11WGP-LABEL: name: singlethread_one_as_acq_rel
   ; GFX11WGP: bb.0.entry:
+  ; GFX11WGP-NEXT:   S_WAITCNT_soft 65527
   ; GFX11WGP-NEXT:   S_ENDPGM 0
   ;
   ; GFX11CU-LABEL: name: singlethread_one_as_acq_rel
   ; GFX11CU: bb.0.entry:
+  ; GFX11CU-NEXT:   S_WAITCNT_soft 65527
   ; GFX11CU-NEXT:   S_ENDPGM 0
 entry:
   fence syncscope("singlethread-one-as") acq_rel
@@ -286,26 +304,32 @@ entry:
 define amdgpu_kernel void @singlethread_one_as_seq_cst() #0 {
   ; GFX6-LABEL: name: singlethread_one_as_seq_cst
   ; GFX6: bb.0.entry:
+  ; GFX6-NEXT:   S_WAITCNT_soft 3967
   ; GFX6-NEXT:   S_ENDPGM 0
   ;
   ; GFX8-LABEL: name: singlethread_one_as_seq_cst
   ; GFX8: bb.0.entry:
+  ; GFX8-NEXT:   S_WAITCNT_soft 3967
   ; GFX8-NEXT:   S_ENDPGM 0
   ;
   ; GFX10WGP-LABEL: name: singlethread_one_as_seq_cst
   ; GFX10WGP: bb.0.entry:
+  ; GFX10WGP-NEXT:   S_WAITCNT_soft 65407
   ; GFX10WGP-NEXT:   S_ENDPGM 0
   ;
   ; GFX10CU-LABEL: name: singlethread_one_as_seq_cst
   ; GFX10CU: bb.0.entry:
+  ; GFX10CU-NEXT:   S_WAITCNT_soft 65407
   ; GFX10CU-NEXT:   S_ENDPGM 0
   ;
   ; GFX11WGP-LABEL: name: singlethread_one_as_seq_cst
   ; GFX11WGP: bb.0.entry:
+  ; GFX11WGP-NEXT:   S_WAITCNT_soft 65527
   ; GFX11WGP-NEXT:   S_ENDPGM 0
   ;
   ; GFX11CU-LABEL: name: singlethread_one_as_seq_cst
   ; GFX11CU: bb.0.entry:
+  ; GFX11CU-NEXT:   S_WAITCNT_soft 65527
   ; GFX11CU-NEXT:   S_ENDPGM 0
 entry:
   fence syncscope("singlethread-one-as") seq_cst
@@ -501,10 +525,12 @@ entry:
 define amdgpu_kernel void @workgroup_one_as_acquire() #0 {
   ; GFX6-LABEL: name: workgroup_one_as_acquire
   ; GFX6: bb.0.entry:
+  ; GFX6-NEXT:   S_WAITCNT_soft 3967
   ; GFX6-NEXT:   S_ENDPGM 0
   ;
   ; GFX8-LABEL: name: workgroup_one_as_acquire
   ; GFX8: bb.0.entry:
+  ; GFX8-NEXT:   S_WAITCNT_soft 3967
   ; GFX8-NEXT:   S_ENDPGM 0
   ;
   ; GFX10WGP-LABEL: name: workgroup_one_as_acquire
@@ -516,6 +542,7 @@ define amdgpu_kernel void @workgroup_one_as_acquire() #0 {
   ;
   ; GFX10CU-LABEL: name: workgroup_one_as_acquire
   ; GFX10CU: bb.0.entry:
+  ; GFX10CU-NEXT:   S_WAITCNT_soft 65407
   ; GFX10CU-NEXT:   S_ENDPGM 0
   ;
   ; GFX11WGP-LABEL: name: workgroup_one_as_acquire
@@ -527,6 +554,7 @@ define amdgpu_kernel void @workgroup_one_as_acquire() #0 {
   ;
   ; GFX11CU-LABEL: name: workgroup_one_as_acquire
   ; GFX11CU: bb.0.entry:
+  ; GFX11CU-NEXT:   S_WAITCNT_soft 65527
   ; GFX11CU-NEXT:   S_ENDPGM 0
 entry:
   fence syncscope("workgroup-one-as") acquire
@@ -536,10 +564,12 @@ entry:
 define amdgpu_kernel void @workgroup_one_as_release() #0 {
   ; GFX6-LABEL: name: workgroup_one_as_release
   ; GFX6: bb.0.entry:
+  ; GFX6-NEXT:   S_WAITCNT_soft 3967
   ; GFX6-NEXT:   S_ENDPGM 0
   ;
   ; GFX8-LABEL: name: workgroup_one_as_release
   ; GFX8: bb.0.entry:
+  ; GFX8-NEXT:   S_WAITCNT_soft 3967
   ; GFX8-NEXT:   S_ENDPGM 0
   ;
   ; GFX10WGP-LABEL: name: workgroup_one_as_release
@@ -550,6 +580,7 @@ define amdgpu_kernel void @workgroup_one_as_release() #0 {
   ;
   ; GFX10CU-LABEL: name: workgroup_one_as_release
   ; GFX10CU: bb.0.entry:
+  ; GFX10CU-NEXT:   S_WAITCNT_soft 65407
   ; GFX10CU-NEXT:   S_ENDPGM 0
   ;
   ; GFX11WGP-LABEL: name: workgroup_one_as_release
@@ -560,6 +591,7 @@ define amdgpu_kernel void @workgroup_one_as_release() #0 {
   ;
   ; GFX11CU-LABEL: name: workgroup_one_as_release
   ; GFX11CU: bb.0.entry:
+  ; GFX11CU-NEXT:   S_WAITCNT_soft 65527
   ; GFX11CU-NEXT:   S_ENDPGM 0
 entry:
   fence syncscope("workgroup-one-as") release
@@ -569,10 +601,12 @@ entry:
 define amdgpu_kernel void @workgroup_one_as_acq_rel() #0 {
   ; GFX6-LABEL: name: workgroup_one_as_acq_rel
   ; GFX6: bb.0.entry:
+  ; GFX6-NEXT:   S_WAITCNT_soft 3967
   ; GFX6-NEXT:   S_ENDPGM 0
   ;
   ; GFX8-LABEL: name: workgroup_one_as_acq_rel
   ; GFX8: bb.0.entry:
+  ; GFX8-NEXT:   S_WAITCNT_soft 3967
   ; GFX8-NEXT:   S_ENDPGM 0
   ;
   ; GFX10WGP-LABEL: name: workgroup_one_as_acq_rel
@@ -584,6 +618,7 @@ define amdgpu_kernel void @workgroup_one_as_acq_rel() #0 {
   ;
   ; GFX10CU-LABEL: name: workgroup_one_as_acq_rel
   ; GFX10CU: bb.0.entry:
+  ; GFX10CU-NEXT:   S_WAITCNT_soft 65407
   ; GFX10CU-NEXT:   S_ENDPGM 0
   ;
   ; GFX11WGP-LABEL: name: workgroup_one_as_acq_rel
@@ -595,6 +630,7 @@ define amdgpu_kernel void @workgroup_one_as_acq_rel() #0 {
   ;
   ; GFX11CU-LABEL: name: workgroup_one_as_acq_rel
   ; GFX11CU: bb.0.entry:
+  ; GFX11CU-NEXT:   S_WAITCNT_soft 65527
   ; GFX11CU-NEXT:   S_ENDPGM 0
 entry:
   fence syncscope("workgroup-one-as") acq_rel
@@ -604,10 +640,12 @@ entry:
 define amdgpu_kernel void @workgroup_one_as_seq_cst() #0 {
   ; GFX6-LABEL: name: workgroup_one_as_seq_cst
   ; GFX6: bb.0.entry:
+  ; GFX6-NEXT:   S_WAITCNT_soft 3967
   ; GFX6-NEXT:   S_ENDPGM 0
   ;
   ; GFX8-LABEL: name: workgroup_one_as_seq_cst
   ; GFX8: bb.0.entry:
+  ; GFX8-NEXT:   S_WAITCNT_soft 3967
   ; GFX8-NEXT:   S_ENDPGM 0
   ;
   ; GFX10WGP-LABEL: name: workgroup_one_as_seq_cst
@@ -619,6 +657,7 @@ define amdgpu_kernel void @workgroup_one_as_seq_cst() #0 {
   ;
   ; GFX10CU-LABEL: name: workgroup_one_as_seq_cst
   ; GFX10CU: bb.0.entry:
+  ; GFX10CU-NEXT:   S_WAITCNT_soft 65407
   ; GFX10CU-NEXT:   S_ENDPGM 0
   ;
   ; GFX11WGP-LABEL: name: workgroup_one_as_seq_cst
@@ -630,6 +669,7 @@ define amdgpu_kernel void @workgroup_one_as_seq_cst() #0 {
   ;
   ; GFX11CU-LABEL: name: workgroup_one_as_seq_cst
   ; GFX11CU: bb.0.entry:
+  ; GFX11CU-NEXT:   S_WAITCNT_soft 65527
   ; GFX11CU-NEXT:   S_ENDPGM 0
 entry:
   fence syncscope("workgroup-one-as") seq_cst
@@ -639,26 +679,32 @@ entry:
 define amdgpu_kernel void @wavefront_one_as_acquire() #0 {
   ; GFX6-LABEL: name: wavefront_one_as_acquire
   ; GFX6: bb.0.entry:
+  ; GFX6-NEXT:   S_WAITCNT_soft 3967
   ; GFX6-NEXT:   S_ENDPGM 0
   ;
   ; GFX8-LABEL: name: wavefront_one_as_acquire
   ; GFX8: bb.0.entry:
+  ; GFX8-NEXT:   S_WAITCNT_soft 3967
   ; GFX8-NEXT:   S_ENDPGM 0
   ;
   ; GFX10WGP-LABEL: name: wavefront_one_as_acquire
   ; GFX10WGP: bb.0.entry:
+  ; GFX10WGP-NEXT:   S_WAITCNT_soft 65407
   ; GFX10WGP-NEXT:   S_ENDPGM 0
   ;
   ; GFX10CU-LABEL: name: wavefront_one_as_acquire
   ; GFX10CU: bb.0.entry:
+  ; GFX10CU-NEXT:   S_WAITCNT_soft 65407
   ; GFX10CU-NEXT:   S_ENDPGM 0
   ;
   ; GFX11WGP-LABEL: name: wavefront_one_as_acquire
   ; GFX11WGP: bb.0.entry:
+  ; GFX11WGP-NEXT:   S_WAITCNT_soft 65527
   ; GFX11WGP-NEXT:   S_ENDPGM 0
   ;
   ; GFX11CU-LABEL: name: wavefront_one_as_acquire
   ; GFX11CU: bb.0.entry:
+  ; GFX11CU-NEXT:   S_WAITCNT_soft 65527
   ; GFX11CU-NEXT:   S_ENDPGM 0
 entry:
   fence syncscope("wavefront-one-as") acquire
@@ -668,26 +714,32 @@ entry:
 define amdgpu_kernel void @wavefront_one_as_release() #0 {
   ; GFX6-LABEL: name: wavefront_one_as_release
   ; GFX6: bb.0.entry:
+  ; GFX6-NEXT:   S_WAITCNT_soft 3967
   ; GFX6-NEXT:   S_ENDPGM 0
   ;
   ; GFX8-LABEL: name: wavefront_one_as_release
   ; GFX8: bb.0.entry:
+  ; GFX8-NEXT:   S_WAITCNT_soft 3967
   ; GFX8-NEXT:   S_ENDPGM 0
   ;
   ; GFX10WGP-LABEL: name: wavefront_one_as_release
   ; GFX10WGP: bb.0.entry:
+  ; GFX10WGP-NEXT:   S_WAITCNT_soft 65407
   ; GFX10WGP-NEXT:   S_ENDPGM 0
   ;
   ; GFX10CU-LABEL: name: wavefront_one_as_release
   ; GFX10CU: bb.0.entry:
+  ; GFX10CU-NEXT:   S_WAITCNT_soft 65407
   ; GFX10CU-NEXT:   S_ENDPGM 0
   ;
   ; GFX11WGP-LABEL: name: wavefront_one_as_release
   ; GFX11WGP: bb.0.entry:
+  ; GFX11WGP-NEXT:   S_WAITCNT_soft 65527
   ; GFX11WGP-NEXT:   S_ENDPGM 0
   ;
   ; GFX11CU-LABEL: name: wavefront_one_as_release
   ; GFX11CU: bb.0.entry:
+  ; GFX11CU-NEXT:   S_WAITCNT_soft 65527
   ; GFX11CU-NEXT:   S_ENDPGM 0
 entry:
   fence syncscope("wavefront-one-as") release
@@ -697,26 +749,32 @@ entry:
 define amdgpu_kernel void @wavefront_one_as_acq_rel() #0 {
   ; GFX6-LABEL: name: wavefront_one_as_acq_rel
   ; GFX6: bb.0.entry:
+  ; GFX6-NEXT:   S_WAITCNT_soft 3967
   ; GFX6-NEXT:   S_ENDPGM 0
   ;
   ; GFX8-LABEL: name: wavefront_one_as_acq_rel
   ; GFX8: bb.0.entry:
+  ; GFX8-NEXT:   S_WAITCNT_soft 3967
   ; GFX8-NEXT:   S_ENDPGM 0
   ;
   ; GFX10WGP-LABEL: name: wavefront_one_as_acq_rel
   ; GFX10WGP: bb.0.entry:
+  ; GFX10WGP-NEXT:   S_WAITCNT_soft 65407
   ; GFX10WGP-NEXT:   S_ENDPGM 0
   ;
   ; GFX10CU-LABEL: name: wavefront_one_as_acq_rel
   ; GFX10CU: bb.0.entry:
+  ; GFX10CU-NEXT:   S_WAITCNT_soft 65407
   ; GFX10CU-NEXT:   S_ENDPGM 0
   ;
   ; GFX11WGP-LABEL: name: wavefront_one_as_acq_rel
   ; GFX11WGP: bb.0.entry:
+  ; GFX11WGP-NEXT:   S_WAITCNT_soft 65527
   ; GFX11WGP-NEXT:   S_ENDPGM 0
   ;
   ; GFX11CU-LABEL: name: wavefront_one_as_acq_rel
   ; GFX11CU: bb.0.entry:
+  ;...
[truncated]

@llvmbot
Copy link
Member

llvmbot commented Jul 7, 2025

@llvm/pr-subscribers-llvm-globalisel

Author: Sameer Sahasrabuddhe (ssahasra)

Changes

The memory legalizer is currently responsible for emitting wait instructions at ordering operations such as acquire and release. It tries to be efficient by emitting waits only when required. In particular, it does not emit a wait on vmcnt at workgroup scope since that ordering is already guaranteed by the architecture. But this is now incorrect because direct loads to LDS have an LDS component which needs explicit ordering on vmcnt. But it is inefficient to always emit a wait on vmcnt since majority of the programs do not use direct loads to LDS, and this will affect all workgroup scope operations.

As a first step to that, the memory legalizer now emits a soft wait instruction even if all counts are trivially ~0. This is a placeholder that the SIInsertWaitcnts pass will either optimize away or strenghthen based on its analysis of whether direct loads to LDS are pending at this point in the program.


Patch is 4.42 MiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/147257.diff

41 Files Affected:

  • (modified) llvm/lib/Target/AMDGPU/SIMemoryLegalizer.cpp (+25-33)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/flat-scratch.ll (+6-6)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/memory-legalizer-atomic-fence.ll (+112)
  • (modified) llvm/test/CodeGen/AMDGPU/attributor-flatscratchinit-undefined-behavior2.ll (+2-2)
  • (modified) llvm/test/CodeGen/AMDGPU/call-argument-types.ll (+2-2)
  • (modified) llvm/test/CodeGen/AMDGPU/dynamic_stackalloc.ll (+6-6)
  • (modified) llvm/test/CodeGen/AMDGPU/flat-scratch.ll (+6-6)
  • (modified) llvm/test/CodeGen/AMDGPU/function-args.ll (+66-66)
  • (modified) llvm/test/CodeGen/AMDGPU/indirect-addressing-si.ll (+2)
  • (modified) llvm/test/CodeGen/AMDGPU/kernel-vgpr-spill-mubuf-with-voffset.ll (+1)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.update.dpp.ll (+1)
  • (modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-fence-mmra-global.ll (+64)
  • (modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-fence-mmra-local.ll (+168-6)
  • (modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-fence.ll (+220-4)
  • (modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-flat-agent.ll (+160-32)
  • (modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-flat-singlethread.ll (+1420)
  • (modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-flat-system.ll (+160-32)
  • (modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-flat-volatile.ll (+14-2)
  • (modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-flat-wavefront.ll (+1410)
  • (modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-flat-workgroup.ll (+576-68)
  • (modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-global-agent.ll (+192)
  • (modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-global-singlethread.ll (+1152-52)
  • (modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-global-system.ll (+168)
  • (modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-global-volatile.ll (+14-1)
  • (modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-global-wavefront.ll (+1152-52)
  • (modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-global-workgroup.ll (+706-82)
  • (modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-local-agent.ll (+940-98)
  • (modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-local-nontemporal.ll (+3-2)
  • (modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-local-singlethread.ll (+1548)
  • (modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-local-system.ll (+940-98)
  • (modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-local-volatile.ll (+27-7)
  • (modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-local-wavefront.ll (+1548)
  • (modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-local-workgroup.ll (+940-98)
  • (modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-local.mir (+31)
  • (modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-private-volatile.ll (+12)
  • (modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-region.mir (+31)
  • (modified) llvm/test/CodeGen/AMDGPU/mubuf-legalize-operands-non-ptr-intrinsics.ll (+4-4)
  • (modified) llvm/test/CodeGen/AMDGPU/mubuf-legalize-operands.ll (+4-4)
  • (modified) llvm/test/CodeGen/AMDGPU/stacksave_stackrestore.ll (+6)
  • (modified) llvm/test/CodeGen/AMDGPU/trap-abis.ll (+5)
  • (modified) llvm/test/CodeGen/AMDGPU/wait-before-stores-with-scope_sys.mir (+1)
diff --git a/llvm/lib/Target/AMDGPU/SIMemoryLegalizer.cpp b/llvm/lib/Target/AMDGPU/SIMemoryLegalizer.cpp
index 3212060f303a5..f015d3ad7811e 100644
--- a/llvm/lib/Target/AMDGPU/SIMemoryLegalizer.cpp
+++ b/llvm/lib/Target/AMDGPU/SIMemoryLegalizer.cpp
@@ -1074,8 +1074,6 @@ bool SIGfx6CacheControl::insertWait(MachineBasicBlock::iterator &MI,
                                     SIAtomicAddrSpace AddrSpace, SIMemOp Op,
                                     bool IsCrossAddrSpaceOrdering, Position Pos,
                                     AtomicOrdering Order) const {
-  bool Changed = false;
-
   MachineBasicBlock &MBB = *MI->getParent();
   DebugLoc DL = MI->getDebugLoc();
 
@@ -1149,21 +1147,19 @@ bool SIGfx6CacheControl::insertWait(MachineBasicBlock::iterator &MI,
     }
   }
 
-  if (VMCnt || LGKMCnt) {
-    unsigned WaitCntImmediate =
-      AMDGPU::encodeWaitcnt(IV,
-                            VMCnt ? 0 : getVmcntBitMask(IV),
-                            getExpcntBitMask(IV),
-                            LGKMCnt ? 0 : getLgkmcntBitMask(IV));
-    BuildMI(MBB, MI, DL, TII->get(AMDGPU::S_WAITCNT_soft))
-        .addImm(WaitCntImmediate);
-    Changed = true;
-  }
+  // Always emit a soft wait count, even if it is trivially ~0. SIInsertWaitcnts
+  // will later use this marker to add additional waits such as those required
+  // from direct load to LDS (formerly known as LDS DMA).
+  unsigned WaitCntImmediate = AMDGPU::encodeWaitcnt(
+      IV, VMCnt ? 0 : getVmcntBitMask(IV), getExpcntBitMask(IV),
+      LGKMCnt ? 0 : getLgkmcntBitMask(IV));
+  BuildMI(MBB, MI, DL, TII->get(AMDGPU::S_WAITCNT_soft))
+      .addImm(WaitCntImmediate);
 
   if (Pos == Position::AFTER)
     --MI;
 
-  return Changed;
+  return true;
 }
 
 bool SIGfx6CacheControl::insertAcquire(MachineBasicBlock::iterator &MI,
@@ -1966,8 +1962,6 @@ bool SIGfx10CacheControl::insertWait(MachineBasicBlock::iterator &MI,
                                      SIAtomicAddrSpace AddrSpace, SIMemOp Op,
                                      bool IsCrossAddrSpaceOrdering,
                                      Position Pos, AtomicOrdering Order) const {
-  bool Changed = false;
-
   MachineBasicBlock &MBB = *MI->getParent();
   DebugLoc DL = MI->getDebugLoc();
 
@@ -2057,28 +2051,25 @@ bool SIGfx10CacheControl::insertWait(MachineBasicBlock::iterator &MI,
     }
   }
 
-  if (VMCnt || LGKMCnt) {
-    unsigned WaitCntImmediate =
-      AMDGPU::encodeWaitcnt(IV,
-                            VMCnt ? 0 : getVmcntBitMask(IV),
-                            getExpcntBitMask(IV),
-                            LGKMCnt ? 0 : getLgkmcntBitMask(IV));
-    BuildMI(MBB, MI, DL, TII->get(AMDGPU::S_WAITCNT_soft))
-        .addImm(WaitCntImmediate);
-    Changed = true;
-  }
+  // Always emit a soft wait count, even if it is trivially ~0. SIInsertWaitcnts
+  // will later use this marker to add additional waits such as those required
+  // from direct load to LDS (formerly known as LDS DMA).
+  unsigned WaitCntImmediate = AMDGPU::encodeWaitcnt(
+      IV, VMCnt ? 0 : getVmcntBitMask(IV), getExpcntBitMask(IV),
+      LGKMCnt ? 0 : getLgkmcntBitMask(IV));
+  BuildMI(MBB, MI, DL, TII->get(AMDGPU::S_WAITCNT_soft))
+      .addImm(WaitCntImmediate);
 
   if (VSCnt) {
     BuildMI(MBB, MI, DL, TII->get(AMDGPU::S_WAITCNT_VSCNT_soft))
         .addReg(AMDGPU::SGPR_NULL, RegState::Undef)
         .addImm(0);
-    Changed = true;
   }
 
   if (Pos == Position::AFTER)
     --MI;
 
-  return Changed;
+  return true;
 }
 
 bool SIGfx10CacheControl::insertAcquire(MachineBasicBlock::iterator &MI,
@@ -2287,8 +2278,6 @@ bool SIGfx12CacheControl::insertWait(MachineBasicBlock::iterator &MI,
                                      SIAtomicAddrSpace AddrSpace, SIMemOp Op,
                                      bool IsCrossAddrSpaceOrdering,
                                      Position Pos, AtomicOrdering Order) const {
-  bool Changed = false;
-
   MachineBasicBlock &MBB = *MI->getParent();
   DebugLoc DL = MI->getDebugLoc();
 
@@ -2372,23 +2361,26 @@ bool SIGfx12CacheControl::insertWait(MachineBasicBlock::iterator &MI,
       BuildMI(MBB, MI, DL, TII->get(AMDGPU::S_WAIT_SAMPLECNT_soft)).addImm(0);
     }
     BuildMI(MBB, MI, DL, TII->get(AMDGPU::S_WAIT_LOADCNT_soft)).addImm(0);
-    Changed = true;
+  } else {
+    // Always emit a soft wait count, even if it is trivially ~0.
+    // SIInsertWaitcnts will later use this marker to add additional waits such
+    // as those required from direct load to LDS (formerly known as LDS DMA).
+    BuildMI(MBB, MI, DL, TII->get(AMDGPU::S_WAIT_LOADCNT_soft))
+        .addImm(getLoadcntBitMask(IV));
   }
 
   if (STORECnt) {
     BuildMI(MBB, MI, DL, TII->get(AMDGPU::S_WAIT_STORECNT_soft)).addImm(0);
-    Changed = true;
   }
 
   if (DSCnt) {
     BuildMI(MBB, MI, DL, TII->get(AMDGPU::S_WAIT_DSCNT_soft)).addImm(0);
-    Changed = true;
   }
 
   if (Pos == Position::AFTER)
     --MI;
 
-  return Changed;
+  return true;
 }
 
 bool SIGfx12CacheControl::insertAcquire(MachineBasicBlock::iterator &MI,
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/flat-scratch.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/flat-scratch.ll
index 8a80afd4a768f..1bbbec977b714 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/flat-scratch.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/flat-scratch.ll
@@ -880,8 +880,8 @@ define amdgpu_kernel void @store_load_vindex_small_offset_kernel(i32 %n) {
 ; GFX10-NEXT:    v_lshlrev_b32_e32 v1, 2, v1
 ; GFX10-NEXT:    v_add_nc_u32_e32 v0, 0x100, v0
 ; GFX10-NEXT:    scratch_store_dword v0, v2, off offset:128
-; GFX10-NEXT:    s_waitcnt_vscnt null, 0x0
 ; GFX10-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX10-NEXT:    s_waitcnt_vscnt null, 0x0
 ; GFX10-NEXT:    s_lshl_b32 s0, s0, 7
 ; GFX10-NEXT:    s_add_u32 s0, 0x100, s0
 ; GFX10-NEXT:    v_add_nc_u32_e32 v1, s0, v1
@@ -921,8 +921,8 @@ define amdgpu_kernel void @store_load_vindex_small_offset_kernel(i32 %n) {
 ; GFX11-NEXT:    s_delay_alu instid0(VALU_DEP_2) | instskip(SKIP_4) | instid1(SALU_CYCLE_1)
 ; GFX11-NEXT:    v_lshlrev_b32_e32 v1, 2, v1
 ; GFX11-NEXT:    scratch_store_b32 v0, v2, off offset:384 dlc
-; GFX11-NEXT:    s_waitcnt_vscnt null, 0x0
 ; GFX11-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX11-NEXT:    s_waitcnt_vscnt null, 0x0
 ; GFX11-NEXT:    s_lshl_b32 s0, s0, 7
 ; GFX11-NEXT:    s_add_u32 s0, 0x100, s0
 ; GFX11-NEXT:    s_delay_alu instid0(SALU_CYCLE_1)
@@ -991,8 +991,8 @@ define amdgpu_kernel void @store_load_vindex_small_offset_kernel(i32 %n) {
 ; UNALIGNED_GFX10-NEXT:    v_lshlrev_b32_e32 v1, 2, v1
 ; UNALIGNED_GFX10-NEXT:    v_add_nc_u32_e32 v0, 0x100, v0
 ; UNALIGNED_GFX10-NEXT:    scratch_store_dword v0, v2, off offset:128
-; UNALIGNED_GFX10-NEXT:    s_waitcnt_vscnt null, 0x0
 ; UNALIGNED_GFX10-NEXT:    s_waitcnt lgkmcnt(0)
+; UNALIGNED_GFX10-NEXT:    s_waitcnt_vscnt null, 0x0
 ; UNALIGNED_GFX10-NEXT:    s_lshl_b32 s0, s0, 7
 ; UNALIGNED_GFX10-NEXT:    s_add_u32 s0, 0x100, s0
 ; UNALIGNED_GFX10-NEXT:    v_add_nc_u32_e32 v1, s0, v1
@@ -1032,8 +1032,8 @@ define amdgpu_kernel void @store_load_vindex_small_offset_kernel(i32 %n) {
 ; UNALIGNED_GFX11-NEXT:    s_delay_alu instid0(VALU_DEP_2) | instskip(SKIP_4) | instid1(SALU_CYCLE_1)
 ; UNALIGNED_GFX11-NEXT:    v_lshlrev_b32_e32 v1, 2, v1
 ; UNALIGNED_GFX11-NEXT:    scratch_store_b32 v0, v2, off offset:384 dlc
-; UNALIGNED_GFX11-NEXT:    s_waitcnt_vscnt null, 0x0
 ; UNALIGNED_GFX11-NEXT:    s_waitcnt lgkmcnt(0)
+; UNALIGNED_GFX11-NEXT:    s_waitcnt_vscnt null, 0x0
 ; UNALIGNED_GFX11-NEXT:    s_lshl_b32 s0, s0, 7
 ; UNALIGNED_GFX11-NEXT:    s_add_u32 s0, 0x100, s0
 ; UNALIGNED_GFX11-NEXT:    s_delay_alu instid0(SALU_CYCLE_1)
@@ -1520,8 +1520,8 @@ define amdgpu_kernel void @store_load_vindex_large_offset_kernel(i32 %n) {
 ; GFX10-NEXT:    v_lshlrev_b32_e32 v1, 2, v1
 ; GFX10-NEXT:    v_add_nc_u32_e32 v0, 0x4004, v0
 ; GFX10-NEXT:    scratch_store_dword v0, v2, off offset:128
-; GFX10-NEXT:    s_waitcnt_vscnt null, 0x0
 ; GFX10-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX10-NEXT:    s_waitcnt_vscnt null, 0x0
 ; GFX10-NEXT:    s_lshl_b32 s0, s0, 7
 ; GFX10-NEXT:    s_add_u32 s0, 0x4004, s0
 ; GFX10-NEXT:    v_add_nc_u32_e32 v1, s0, v1
@@ -1633,8 +1633,8 @@ define amdgpu_kernel void @store_load_vindex_large_offset_kernel(i32 %n) {
 ; UNALIGNED_GFX10-NEXT:    v_lshlrev_b32_e32 v1, 2, v1
 ; UNALIGNED_GFX10-NEXT:    v_add_nc_u32_e32 v0, 0x4004, v0
 ; UNALIGNED_GFX10-NEXT:    scratch_store_dword v0, v2, off offset:128
-; UNALIGNED_GFX10-NEXT:    s_waitcnt_vscnt null, 0x0
 ; UNALIGNED_GFX10-NEXT:    s_waitcnt lgkmcnt(0)
+; UNALIGNED_GFX10-NEXT:    s_waitcnt_vscnt null, 0x0
 ; UNALIGNED_GFX10-NEXT:    s_lshl_b32 s0, s0, 7
 ; UNALIGNED_GFX10-NEXT:    s_add_u32 s0, 0x4004, s0
 ; UNALIGNED_GFX10-NEXT:    v_add_nc_u32_e32 v1, s0, v1
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/memory-legalizer-atomic-fence.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/memory-legalizer-atomic-fence.ll
index 66037615f0ba0..ea6a5d5e74d52 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/memory-legalizer-atomic-fence.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/memory-legalizer-atomic-fence.ll
@@ -199,26 +199,32 @@ entry:
 define amdgpu_kernel void @singlethread_one_as_acquire() #0 {
   ; GFX6-LABEL: name: singlethread_one_as_acquire
   ; GFX6: bb.0.entry:
+  ; GFX6-NEXT:   S_WAITCNT_soft 3967
   ; GFX6-NEXT:   S_ENDPGM 0
   ;
   ; GFX8-LABEL: name: singlethread_one_as_acquire
   ; GFX8: bb.0.entry:
+  ; GFX8-NEXT:   S_WAITCNT_soft 3967
   ; GFX8-NEXT:   S_ENDPGM 0
   ;
   ; GFX10WGP-LABEL: name: singlethread_one_as_acquire
   ; GFX10WGP: bb.0.entry:
+  ; GFX10WGP-NEXT:   S_WAITCNT_soft 65407
   ; GFX10WGP-NEXT:   S_ENDPGM 0
   ;
   ; GFX10CU-LABEL: name: singlethread_one_as_acquire
   ; GFX10CU: bb.0.entry:
+  ; GFX10CU-NEXT:   S_WAITCNT_soft 65407
   ; GFX10CU-NEXT:   S_ENDPGM 0
   ;
   ; GFX11WGP-LABEL: name: singlethread_one_as_acquire
   ; GFX11WGP: bb.0.entry:
+  ; GFX11WGP-NEXT:   S_WAITCNT_soft 65527
   ; GFX11WGP-NEXT:   S_ENDPGM 0
   ;
   ; GFX11CU-LABEL: name: singlethread_one_as_acquire
   ; GFX11CU: bb.0.entry:
+  ; GFX11CU-NEXT:   S_WAITCNT_soft 65527
   ; GFX11CU-NEXT:   S_ENDPGM 0
 entry:
   fence syncscope("singlethread-one-as") acquire
@@ -228,26 +234,32 @@ entry:
 define amdgpu_kernel void @singlethread_one_as_release() #0 {
   ; GFX6-LABEL: name: singlethread_one_as_release
   ; GFX6: bb.0.entry:
+  ; GFX6-NEXT:   S_WAITCNT_soft 3967
   ; GFX6-NEXT:   S_ENDPGM 0
   ;
   ; GFX8-LABEL: name: singlethread_one_as_release
   ; GFX8: bb.0.entry:
+  ; GFX8-NEXT:   S_WAITCNT_soft 3967
   ; GFX8-NEXT:   S_ENDPGM 0
   ;
   ; GFX10WGP-LABEL: name: singlethread_one_as_release
   ; GFX10WGP: bb.0.entry:
+  ; GFX10WGP-NEXT:   S_WAITCNT_soft 65407
   ; GFX10WGP-NEXT:   S_ENDPGM 0
   ;
   ; GFX10CU-LABEL: name: singlethread_one_as_release
   ; GFX10CU: bb.0.entry:
+  ; GFX10CU-NEXT:   S_WAITCNT_soft 65407
   ; GFX10CU-NEXT:   S_ENDPGM 0
   ;
   ; GFX11WGP-LABEL: name: singlethread_one_as_release
   ; GFX11WGP: bb.0.entry:
+  ; GFX11WGP-NEXT:   S_WAITCNT_soft 65527
   ; GFX11WGP-NEXT:   S_ENDPGM 0
   ;
   ; GFX11CU-LABEL: name: singlethread_one_as_release
   ; GFX11CU: bb.0.entry:
+  ; GFX11CU-NEXT:   S_WAITCNT_soft 65527
   ; GFX11CU-NEXT:   S_ENDPGM 0
 entry:
   fence syncscope("singlethread-one-as") release
@@ -257,26 +269,32 @@ entry:
 define amdgpu_kernel void @singlethread_one_as_acq_rel() #0 {
   ; GFX6-LABEL: name: singlethread_one_as_acq_rel
   ; GFX6: bb.0.entry:
+  ; GFX6-NEXT:   S_WAITCNT_soft 3967
   ; GFX6-NEXT:   S_ENDPGM 0
   ;
   ; GFX8-LABEL: name: singlethread_one_as_acq_rel
   ; GFX8: bb.0.entry:
+  ; GFX8-NEXT:   S_WAITCNT_soft 3967
   ; GFX8-NEXT:   S_ENDPGM 0
   ;
   ; GFX10WGP-LABEL: name: singlethread_one_as_acq_rel
   ; GFX10WGP: bb.0.entry:
+  ; GFX10WGP-NEXT:   S_WAITCNT_soft 65407
   ; GFX10WGP-NEXT:   S_ENDPGM 0
   ;
   ; GFX10CU-LABEL: name: singlethread_one_as_acq_rel
   ; GFX10CU: bb.0.entry:
+  ; GFX10CU-NEXT:   S_WAITCNT_soft 65407
   ; GFX10CU-NEXT:   S_ENDPGM 0
   ;
   ; GFX11WGP-LABEL: name: singlethread_one_as_acq_rel
   ; GFX11WGP: bb.0.entry:
+  ; GFX11WGP-NEXT:   S_WAITCNT_soft 65527
   ; GFX11WGP-NEXT:   S_ENDPGM 0
   ;
   ; GFX11CU-LABEL: name: singlethread_one_as_acq_rel
   ; GFX11CU: bb.0.entry:
+  ; GFX11CU-NEXT:   S_WAITCNT_soft 65527
   ; GFX11CU-NEXT:   S_ENDPGM 0
 entry:
   fence syncscope("singlethread-one-as") acq_rel
@@ -286,26 +304,32 @@ entry:
 define amdgpu_kernel void @singlethread_one_as_seq_cst() #0 {
   ; GFX6-LABEL: name: singlethread_one_as_seq_cst
   ; GFX6: bb.0.entry:
+  ; GFX6-NEXT:   S_WAITCNT_soft 3967
   ; GFX6-NEXT:   S_ENDPGM 0
   ;
   ; GFX8-LABEL: name: singlethread_one_as_seq_cst
   ; GFX8: bb.0.entry:
+  ; GFX8-NEXT:   S_WAITCNT_soft 3967
   ; GFX8-NEXT:   S_ENDPGM 0
   ;
   ; GFX10WGP-LABEL: name: singlethread_one_as_seq_cst
   ; GFX10WGP: bb.0.entry:
+  ; GFX10WGP-NEXT:   S_WAITCNT_soft 65407
   ; GFX10WGP-NEXT:   S_ENDPGM 0
   ;
   ; GFX10CU-LABEL: name: singlethread_one_as_seq_cst
   ; GFX10CU: bb.0.entry:
+  ; GFX10CU-NEXT:   S_WAITCNT_soft 65407
   ; GFX10CU-NEXT:   S_ENDPGM 0
   ;
   ; GFX11WGP-LABEL: name: singlethread_one_as_seq_cst
   ; GFX11WGP: bb.0.entry:
+  ; GFX11WGP-NEXT:   S_WAITCNT_soft 65527
   ; GFX11WGP-NEXT:   S_ENDPGM 0
   ;
   ; GFX11CU-LABEL: name: singlethread_one_as_seq_cst
   ; GFX11CU: bb.0.entry:
+  ; GFX11CU-NEXT:   S_WAITCNT_soft 65527
   ; GFX11CU-NEXT:   S_ENDPGM 0
 entry:
   fence syncscope("singlethread-one-as") seq_cst
@@ -501,10 +525,12 @@ entry:
 define amdgpu_kernel void @workgroup_one_as_acquire() #0 {
   ; GFX6-LABEL: name: workgroup_one_as_acquire
   ; GFX6: bb.0.entry:
+  ; GFX6-NEXT:   S_WAITCNT_soft 3967
   ; GFX6-NEXT:   S_ENDPGM 0
   ;
   ; GFX8-LABEL: name: workgroup_one_as_acquire
   ; GFX8: bb.0.entry:
+  ; GFX8-NEXT:   S_WAITCNT_soft 3967
   ; GFX8-NEXT:   S_ENDPGM 0
   ;
   ; GFX10WGP-LABEL: name: workgroup_one_as_acquire
@@ -516,6 +542,7 @@ define amdgpu_kernel void @workgroup_one_as_acquire() #0 {
   ;
   ; GFX10CU-LABEL: name: workgroup_one_as_acquire
   ; GFX10CU: bb.0.entry:
+  ; GFX10CU-NEXT:   S_WAITCNT_soft 65407
   ; GFX10CU-NEXT:   S_ENDPGM 0
   ;
   ; GFX11WGP-LABEL: name: workgroup_one_as_acquire
@@ -527,6 +554,7 @@ define amdgpu_kernel void @workgroup_one_as_acquire() #0 {
   ;
   ; GFX11CU-LABEL: name: workgroup_one_as_acquire
   ; GFX11CU: bb.0.entry:
+  ; GFX11CU-NEXT:   S_WAITCNT_soft 65527
   ; GFX11CU-NEXT:   S_ENDPGM 0
 entry:
   fence syncscope("workgroup-one-as") acquire
@@ -536,10 +564,12 @@ entry:
 define amdgpu_kernel void @workgroup_one_as_release() #0 {
   ; GFX6-LABEL: name: workgroup_one_as_release
   ; GFX6: bb.0.entry:
+  ; GFX6-NEXT:   S_WAITCNT_soft 3967
   ; GFX6-NEXT:   S_ENDPGM 0
   ;
   ; GFX8-LABEL: name: workgroup_one_as_release
   ; GFX8: bb.0.entry:
+  ; GFX8-NEXT:   S_WAITCNT_soft 3967
   ; GFX8-NEXT:   S_ENDPGM 0
   ;
   ; GFX10WGP-LABEL: name: workgroup_one_as_release
@@ -550,6 +580,7 @@ define amdgpu_kernel void @workgroup_one_as_release() #0 {
   ;
   ; GFX10CU-LABEL: name: workgroup_one_as_release
   ; GFX10CU: bb.0.entry:
+  ; GFX10CU-NEXT:   S_WAITCNT_soft 65407
   ; GFX10CU-NEXT:   S_ENDPGM 0
   ;
   ; GFX11WGP-LABEL: name: workgroup_one_as_release
@@ -560,6 +591,7 @@ define amdgpu_kernel void @workgroup_one_as_release() #0 {
   ;
   ; GFX11CU-LABEL: name: workgroup_one_as_release
   ; GFX11CU: bb.0.entry:
+  ; GFX11CU-NEXT:   S_WAITCNT_soft 65527
   ; GFX11CU-NEXT:   S_ENDPGM 0
 entry:
   fence syncscope("workgroup-one-as") release
@@ -569,10 +601,12 @@ entry:
 define amdgpu_kernel void @workgroup_one_as_acq_rel() #0 {
   ; GFX6-LABEL: name: workgroup_one_as_acq_rel
   ; GFX6: bb.0.entry:
+  ; GFX6-NEXT:   S_WAITCNT_soft 3967
   ; GFX6-NEXT:   S_ENDPGM 0
   ;
   ; GFX8-LABEL: name: workgroup_one_as_acq_rel
   ; GFX8: bb.0.entry:
+  ; GFX8-NEXT:   S_WAITCNT_soft 3967
   ; GFX8-NEXT:   S_ENDPGM 0
   ;
   ; GFX10WGP-LABEL: name: workgroup_one_as_acq_rel
@@ -584,6 +618,7 @@ define amdgpu_kernel void @workgroup_one_as_acq_rel() #0 {
   ;
   ; GFX10CU-LABEL: name: workgroup_one_as_acq_rel
   ; GFX10CU: bb.0.entry:
+  ; GFX10CU-NEXT:   S_WAITCNT_soft 65407
   ; GFX10CU-NEXT:   S_ENDPGM 0
   ;
   ; GFX11WGP-LABEL: name: workgroup_one_as_acq_rel
@@ -595,6 +630,7 @@ define amdgpu_kernel void @workgroup_one_as_acq_rel() #0 {
   ;
   ; GFX11CU-LABEL: name: workgroup_one_as_acq_rel
   ; GFX11CU: bb.0.entry:
+  ; GFX11CU-NEXT:   S_WAITCNT_soft 65527
   ; GFX11CU-NEXT:   S_ENDPGM 0
 entry:
   fence syncscope("workgroup-one-as") acq_rel
@@ -604,10 +640,12 @@ entry:
 define amdgpu_kernel void @workgroup_one_as_seq_cst() #0 {
   ; GFX6-LABEL: name: workgroup_one_as_seq_cst
   ; GFX6: bb.0.entry:
+  ; GFX6-NEXT:   S_WAITCNT_soft 3967
   ; GFX6-NEXT:   S_ENDPGM 0
   ;
   ; GFX8-LABEL: name: workgroup_one_as_seq_cst
   ; GFX8: bb.0.entry:
+  ; GFX8-NEXT:   S_WAITCNT_soft 3967
   ; GFX8-NEXT:   S_ENDPGM 0
   ;
   ; GFX10WGP-LABEL: name: workgroup_one_as_seq_cst
@@ -619,6 +657,7 @@ define amdgpu_kernel void @workgroup_one_as_seq_cst() #0 {
   ;
   ; GFX10CU-LABEL: name: workgroup_one_as_seq_cst
   ; GFX10CU: bb.0.entry:
+  ; GFX10CU-NEXT:   S_WAITCNT_soft 65407
   ; GFX10CU-NEXT:   S_ENDPGM 0
   ;
   ; GFX11WGP-LABEL: name: workgroup_one_as_seq_cst
@@ -630,6 +669,7 @@ define amdgpu_kernel void @workgroup_one_as_seq_cst() #0 {
   ;
   ; GFX11CU-LABEL: name: workgroup_one_as_seq_cst
   ; GFX11CU: bb.0.entry:
+  ; GFX11CU-NEXT:   S_WAITCNT_soft 65527
   ; GFX11CU-NEXT:   S_ENDPGM 0
 entry:
   fence syncscope("workgroup-one-as") seq_cst
@@ -639,26 +679,32 @@ entry:
 define amdgpu_kernel void @wavefront_one_as_acquire() #0 {
   ; GFX6-LABEL: name: wavefront_one_as_acquire
   ; GFX6: bb.0.entry:
+  ; GFX6-NEXT:   S_WAITCNT_soft 3967
   ; GFX6-NEXT:   S_ENDPGM 0
   ;
   ; GFX8-LABEL: name: wavefront_one_as_acquire
   ; GFX8: bb.0.entry:
+  ; GFX8-NEXT:   S_WAITCNT_soft 3967
   ; GFX8-NEXT:   S_ENDPGM 0
   ;
   ; GFX10WGP-LABEL: name: wavefront_one_as_acquire
   ; GFX10WGP: bb.0.entry:
+  ; GFX10WGP-NEXT:   S_WAITCNT_soft 65407
   ; GFX10WGP-NEXT:   S_ENDPGM 0
   ;
   ; GFX10CU-LABEL: name: wavefront_one_as_acquire
   ; GFX10CU: bb.0.entry:
+  ; GFX10CU-NEXT:   S_WAITCNT_soft 65407
   ; GFX10CU-NEXT:   S_ENDPGM 0
   ;
   ; GFX11WGP-LABEL: name: wavefront_one_as_acquire
   ; GFX11WGP: bb.0.entry:
+  ; GFX11WGP-NEXT:   S_WAITCNT_soft 65527
   ; GFX11WGP-NEXT:   S_ENDPGM 0
   ;
   ; GFX11CU-LABEL: name: wavefront_one_as_acquire
   ; GFX11CU: bb.0.entry:
+  ; GFX11CU-NEXT:   S_WAITCNT_soft 65527
   ; GFX11CU-NEXT:   S_ENDPGM 0
 entry:
   fence syncscope("wavefront-one-as") acquire
@@ -668,26 +714,32 @@ entry:
 define amdgpu_kernel void @wavefront_one_as_release() #0 {
   ; GFX6-LABEL: name: wavefront_one_as_release
   ; GFX6: bb.0.entry:
+  ; GFX6-NEXT:   S_WAITCNT_soft 3967
   ; GFX6-NEXT:   S_ENDPGM 0
   ;
   ; GFX8-LABEL: name: wavefront_one_as_release
   ; GFX8: bb.0.entry:
+  ; GFX8-NEXT:   S_WAITCNT_soft 3967
   ; GFX8-NEXT:   S_ENDPGM 0
   ;
   ; GFX10WGP-LABEL: name: wavefront_one_as_release
   ; GFX10WGP: bb.0.entry:
+  ; GFX10WGP-NEXT:   S_WAITCNT_soft 65407
   ; GFX10WGP-NEXT:   S_ENDPGM 0
   ;
   ; GFX10CU-LABEL: name: wavefront_one_as_release
   ; GFX10CU: bb.0.entry:
+  ; GFX10CU-NEXT:   S_WAITCNT_soft 65407
   ; GFX10CU-NEXT:   S_ENDPGM 0
   ;
   ; GFX11WGP-LABEL: name: wavefront_one_as_release
   ; GFX11WGP: bb.0.entry:
+  ; GFX11WGP-NEXT:   S_WAITCNT_soft 65527
   ; GFX11WGP-NEXT:   S_ENDPGM 0
   ;
   ; GFX11CU-LABEL: name: wavefront_one_as_release
   ; GFX11CU: bb.0.entry:
+  ; GFX11CU-NEXT:   S_WAITCNT_soft 65527
   ; GFX11CU-NEXT:   S_ENDPGM 0
 entry:
   fence syncscope("wavefront-one-as") release
@@ -697,26 +749,32 @@ entry:
 define amdgpu_kernel void @wavefront_one_as_acq_rel() #0 {
   ; GFX6-LABEL: name: wavefront_one_as_acq_rel
   ; GFX6: bb.0.entry:
+  ; GFX6-NEXT:   S_WAITCNT_soft 3967
   ; GFX6-NEXT:   S_ENDPGM 0
   ;
   ; GFX8-LABEL: name: wavefront_one_as_acq_rel
   ; GFX8: bb.0.entry:
+  ; GFX8-NEXT:   S_WAITCNT_soft 3967
   ; GFX8-NEXT:   S_ENDPGM 0
   ;
   ; GFX10WGP-LABEL: name: wavefront_one_as_acq_rel
   ; GFX10WGP: bb.0.entry:
+  ; GFX10WGP-NEXT:   S_WAITCNT_soft 65407
   ; GFX10WGP-NEXT:   S_ENDPGM 0
   ;
   ; GFX10CU-LABEL: name: wavefront_one_as_acq_rel
   ; GFX10CU: bb.0.entry:
+  ; GFX10CU-NEXT:   S_WAITCNT_soft 65407
   ; GFX10CU-NEXT:   S_ENDPGM 0
   ;
   ; GFX11WGP-LABEL: name: wavefront_one_as_acq_rel
   ; GFX11WGP: bb.0.entry:
+  ; GFX11WGP-NEXT:   S_WAITCNT_soft 65527
   ; GFX11WGP-NEXT:   S_ENDPGM 0
   ;
   ; GFX11CU-LABEL: name: wavefront_one_as_acq_rel
   ; GFX11CU: bb.0.entry:
+  ;...
[truncated]

@@ -669,6 +679,7 @@ define amdgpu_kernel void @global_volatile_store_1(
; GFX12-WGP-NEXT: s_wait_kmcnt 0x0
; GFX12-WGP-NEXT: s_wait_storecnt 0x0
; GFX12-WGP-NEXT: global_store_b32 v0, v1, s[0:1] scope:SCOPE_SYS
; GFX12-WGP-NEXT: s_wait_loadcnt 0x3f
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's unexpected right ? Same for the vmcnt wait above

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These should always be printed with the named counter syntax

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we agree with the basic design, then these are expected. There's a whole bunch of tests that either stop at the memory legalizer, or they run llc with -O0, like this one. The "trivial" wait counts show up in all these tests because SIInsertWaitcnts did not get a chance to clean it up. In particular, see how TrySimplify in that pass controls whether or not to clean up these wait counts. They disappear in the optimized ISA output.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These should always be printed with the named counter syntax

I haven't check what's different about this wait count for it to be printed like this. Will need to follow it up as a separate change.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The waitcnts aren't optimized out at O0 because we want to see them in memory legalizer tests, however we're mostly interested in the waitcnt zero, not the waitcnt ~0
We could still optimize out the ~0 ones, I don't think there is a downside to that

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I did consider that as an option. But there is the hypothetical corner case where the memory legalizer might deliberately compute the wait count to be so large that it gets clamped at the max value (not the same as ~0, strictly speaking). If that is not an issue, it will significantly reduce the diff for tests that don't stop after the legalizer.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a valid concern, though the MemoryLegalizer currently only inserts waitcnts 0, I think?
I also don't see why the memory legalizer would insert non-zero soft waitcnts, I think those would need to be non-soft (but that's not enforced anywhere, afaik)

Copy link
Collaborator Author

@ssahasra ssahasra Jul 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not directly related to this discussion, but this line does exist:

   1390       // Merge consecutive waitcnt of the same type by erasing multiples.
   1391       if (WaitcntInstr || (!Wait.hasWaitExceptStoreCnt() && TrySimplify)) {

It is meant to preserver S_WAITCNT_soft even if there is no actual wait required. @jayfoad , you had introduced TrySimplify ... do you think it is okay to relax its uses?

   1373       if (TrySimplify || (Opcode != II.getOpcode() && OldWait.hasValuesSetToMax())
   1374         ScoreBrackets.simplifyWaitcnt(OldWait);

Here, hasValuesSetToMax() is a hypothetical function that checks the encoding of each count separately to have all bits set to 1, and not just a ~0 in the data structure.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is now cleaned up where possible. SIInsertWaitcnts will clean up any wait count at its maximum value even with optimizations enabled. But this is within the limitation that we can't simply assume that the legalizer had meant it to be a ~0. So instead we rely on simplifyWaitcnts() to do the correct thing by comparing scores.

Base automatically changed from users/ssahasra/waitcnt-update-tests to main July 8, 2025 09:30
ssahasra added 2 commits July 9, 2025 12:55
The memory legalizer is currently responsible for emitting wait instructions at
ordering operations such as acquire and release. It tries to be efficient by
emitting waits only when required. In particular, it does not emit a wait on
vmcnt at workgroup scope since that ordering is already guaranteed by the
architecture. But this is now incorrect because direct loads to LDS have an LDS
component which needs explicit ordering on vmcnt. But it is inefficient to
always emit a wait on vmcnt since majority of the programs do not use direct
loads to LDS, and this will affect all workgroup scope operations.

As a first step to that, the memory legalizer now emits a soft wait instruction
even if all counts are trivially ~0. This is a placeholder that the
SIInsertWaitcnts pass will either optimize away or strenghthen based on its
analysis of whether direct loads to LDS are pending at this point in the
program.
@ssahasra ssahasra force-pushed the users/ssahasra/waitcnt-always-emit branch from 6e30f37 to e6831c6 Compare July 9, 2025 08:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants