[LoopStrengthReduce] Mitigation of issues introduced by compilation time optimization in SolveRecurse. #147588

SergeyShch01 · 2025-07-08T19:46:57Z

This optimization rejects formulas not containing registers already included into current solution and also included into other formulas of current LSRUse. It really helps to reduce compilation time but also has side effects which are mitigated/fixed as follows:

The best found solution becomes dependent on the LSRUses ordering since at any given time the registers included into the current solution (and used to filter formulas) are determined by processed LSRUses for which formulas to be included into current solution are already chosen. So the decisions made for LSRUses which are processed earlier directly affect the decisions made for LSRUses processed later. According to the current implementation the LSRUses ordering seems to be random as it's determined by the order of IV instructions' users traversal which can be assumed random. Besides directly affecting the performance of the generated code (which is randomized), this can also lead to performance fluctuations when the order of IV instructions' users traversal changes for some reason (in fact, it can even be different between clang and llc runs). In order to minimize the described effect given patch introduces LSRUses sorting which establishes more-or-less stable LSRUses ordering (fluctuations are still possible for LSRUses which look identical). This change leads to loop_reduce work changes in many regression tests which are updated accordingly.
The formula filtering in SolveRecurse rejects all formulas that do not have the maximum number of required registers, although it is possible that none of the formulas do (for example, it is possible to have two formulas with two registers and one required register each - then all formulas will be rejected because they do not have two required ones). The change fixes this by gradually decreasing the number of required registers if all formulas are rejected.

The change also introduces an early return if LSRUse w/o formulas is found as that is an unsolvable case.

…ime optimization in SolveRecurse. This optimization rejects formulas not containing registers already included into current solution and also included into other formulas of current LSRUse. It really helps to reduce compilation time but also has side effects which are mitigated/fixed as follows: - The best found solution becomes dependent on the LSRUses ordering since at any given time the registers included into the current solution (and used to filter formulas) are determined by processed LSRUses for which formulas to be included into current solution are already chosen. So the decisions made for LSRUses which are processed earlier directly affect the decisions made for LSRUses processed later. According to the current implementation the LSRUses ordering seems to be random as it's determined by the order of IV instructions' users traversal which can be assumed random. Besides directly affecting the performance of the generated code (which is randomized), this can also lead to performance fluctuations when the order of IV instructions' users traversal changes for some reason (in fact, it can even be different between clang and llc runs). In order to minimize the described effect given patch introduces LSRUses sorting which establishes more-or-less stable LSRUses ordering (fluctuations are still possible for LSRUses which look identical). This change leads to loop_reduce work changes in many regression tests which are updated accordingly. - The formula filtering in SolveRecurse rejects all formulas that do not have the maximum number of required registers, although it is possible that none of the formulas do (for example, it is possible to have two formulas with two registers and one required register each - then all formulas will be rejected because they do not have two required ones). The change fixes this by gradually decreasing the number of required registers if all formulas are rejected. The change also introduces an early return if LSRUse w/o formulas is found as that is an unsolvable case.

github-actions · 2025-07-08T19:47:17Z

Thank you for submitting a Pull Request (PR) to the LLVM Project!

This PR will be automatically labeled and the relevant teams will be notified.

If you wish to, you can add reviewers by using the "Reviewers" section on this page.

If this is not working for you, it is probably because you do not have write permissions for the repository. In which case you can instead tag reviewers by name in a comment by using @ followed by their GitHub username.

If you have received no comments on your PR for a week, you can request a review by "ping"ing the PR by adding a comment “Ping”. The common courtesy "ping" rate is once a week. Please remember that you are asking for valuable time from other developers.

If you have further questions, they may be answered by the LLVM GitHub User Guide.

You can also ask questions in a comment on this PR, on the LLVM Discord or on the forums.

llvmbot · 2025-07-08T19:47:47Z

@llvm/pr-subscribers-backend-risc-v
@llvm/pr-subscribers-backend-webassembly
@llvm/pr-subscribers-backend-x86
@llvm/pr-subscribers-backend-nvptx

@llvm/pr-subscribers-llvm-transforms

Author: Sergey Shcherbinin (SergeyShch01)

Changes

This optimization rejects formulas not containing registers already included into current solution and also included into other formulas of current LSRUse. It really helps to reduce compilation time but also has side effects which are mitigated/fixed as follows:

The best found solution becomes dependent on the LSRUses ordering since at any given time the registers included into the current solution (and used to filter formulas) are determined by processed LSRUses for which formulas to be included into current solution are already chosen. So the decisions made for LSRUses which are processed earlier directly affect the decisions made for LSRUses processed later. According to the current implementation the LSRUses ordering seems to be random as it's determined by the order of IV instructions' users traversal which can be assumed random. Besides directly affecting the performance of the generated code (which is randomized), this can also lead to performance fluctuations when the order of IV instructions' users traversal changes for some reason (in fact, it can even be different between clang and llc runs). In order to minimize the described effect given patch introduces LSRUses sorting which establishes more-or-less stable LSRUses ordering (fluctuations are still possible for LSRUses which look identical). This change leads to loop_reduce work changes in many regression tests which are updated accordingly.
The formula filtering in SolveRecurse rejects all formulas that do not have the maximum number of required registers, although it is possible that none of the formulas do (for example, it is possible to have two formulas with two registers and one required register each - then all formulas will be rejected because they do not have two required ones). The change fixes this by gradually decreasing the number of required registers if all formulas are rejected.

The change also introduces an early return if LSRUse w/o formulas is found as that is an unsolvable case.

Patch is 246.26 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/147588.diff

54 Files Affected:

(modified) llvm/lib/Transforms/Scalar/LoopStrengthReduce.cpp (+108-44)
(modified) llvm/test/CodeGen/AArch64/aarch64-p2align-max-bytes.ll (+3-3)
(modified) llvm/test/CodeGen/AArch64/machine-combiner-copy.ll (+6-6)
(modified) llvm/test/CodeGen/AArch64/machine-licm-sub-loop.ll (+10-10)
(modified) llvm/test/CodeGen/AArch64/zext-to-tbl.ll (+27-27)
(modified) llvm/test/CodeGen/AMDGPU/idiv-licm.ll (+48-50)
(modified) llvm/test/CodeGen/AMDGPU/memintrinsic-unroll.ll (+128-130)
(modified) llvm/test/CodeGen/AMDGPU/memmove-var-size.ll (+46-46)
(modified) llvm/test/CodeGen/AMDGPU/mul24-pass-ordering.ll (+11-11)
(modified) llvm/test/CodeGen/AMDGPU/noclobber-barrier.ll (+44-44)
(modified) llvm/test/CodeGen/ARM/2011-03-15-LdStMultipleBug.ll (+4-6)
(modified) llvm/test/CodeGen/ARM/dsp-loop-indexing.ll (+6-6)
(modified) llvm/test/CodeGen/ARM/fpclamptosat.ll (+34-26)
(modified) llvm/test/CodeGen/ARM/loop-indexing.ll (+1-1)
(modified) llvm/test/CodeGen/NVPTX/load-with-non-coherent-cache.ll (+2-2)
(modified) llvm/test/CodeGen/PowerPC/lsr-profitable-chain.ll (+32-32)
(modified) llvm/test/CodeGen/PowerPC/more-dq-form-prepare.ll (+56-56)
(modified) llvm/test/CodeGen/RISCV/riscv-codegenprepare-asm.ll (+10-7)
(modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-strided-load-store-asm.ll (+2-2)
(modified) llvm/test/CodeGen/Thumb2/LowOverheadLoops/mve-float-loops.ll (+44-44)
(modified) llvm/test/CodeGen/Thumb2/LowOverheadLoops/mve-tail-data-types.ll (+12-12)
(modified) llvm/test/CodeGen/Thumb2/mve-blockplacement.ll (+17-17)
(modified) llvm/test/CodeGen/Thumb2/mve-float16regloops.ll (+7-7)
(modified) llvm/test/CodeGen/Thumb2/mve-float32regloops.ll (+7-7)
(modified) llvm/test/CodeGen/Thumb2/mve-memtp-loop.ll (+27-27)
(modified) llvm/test/CodeGen/WebAssembly/unrolled-mem-indices.ll (+84-77)
(modified) llvm/test/CodeGen/X86/apx/check-nf-in-suppress-reloc-pass.ll (+1-1)
(modified) llvm/test/CodeGen/X86/avx512vnni-combine.ll (+33-25)
(modified) llvm/test/CodeGen/X86/avxvnni-combine.ll (+138-106)
(modified) llvm/test/CodeGen/X86/dag-update-nodetomatch.ll (+37-32)
(modified) llvm/test/CodeGen/X86/loop-strength-reduce7.ll (+5-5)
(modified) llvm/test/CodeGen/X86/masked-iv-safe.ll (+14-14)
(modified) llvm/test/CodeGen/X86/optimize-max-0.ll (+6-6)
(modified) llvm/test/CodeGen/X86/pr42565.ll (+6-6)
(modified) llvm/test/CodeGen/X86/ragreedy-hoist-spill.ll (+1-1)
(modified) llvm/test/Transforms/LoopStrengthReduce/AArch64/vscale-fixups.ll (+42-112)
(modified) llvm/test/Transforms/LoopStrengthReduce/AMDGPU/atomics.ll (+14-14)
(modified) llvm/test/Transforms/LoopStrengthReduce/AMDGPU/different-addrspace-addressing-mode-loops.ll (+14-14)
(modified) llvm/test/Transforms/LoopStrengthReduce/AMDGPU/lsr-postinc-pos-addrspace.ll (+6-6)
(modified) llvm/test/Transforms/LoopStrengthReduce/ARM/complexity.ll (+1-1)
(modified) llvm/test/Transforms/LoopStrengthReduce/ARM/illegal-addr-modes.ll (+1-2)
(modified) llvm/test/Transforms/LoopStrengthReduce/RISCV/many-geps.ll (+15-15)
(modified) llvm/test/Transforms/LoopStrengthReduce/X86/ivchain-X86.ll (+10-10)
(modified) llvm/test/Transforms/LoopStrengthReduce/X86/missing-phi-operand-update.ll (+12-12)
(modified) llvm/test/Transforms/LoopStrengthReduce/X86/nested-loop.ll (+9-9)
(modified) llvm/test/Transforms/LoopStrengthReduce/X86/normalization-during-scev-expansion.ll (+24-24)
(modified) llvm/test/Transforms/LoopStrengthReduce/X86/postinc-iv-used-by-urem-and-udiv.ll (+23-24)
(modified) llvm/test/Transforms/LoopStrengthReduce/X86/pr62660-normalization-failure.ll (+6-7)
(modified) llvm/test/Transforms/LoopStrengthReduce/depth-limit-overrun.ll (+4-4)
(modified) llvm/test/Transforms/LoopStrengthReduce/ivincs-hoist.ll (+3-3)
(modified) llvm/test/Transforms/LoopStrengthReduce/lsr-overflow.ll (+5-5)
(modified) llvm/test/Transforms/LoopStrengthReduce/lsr-term-fold-negative-testcase.ll (+6-6)
(modified) llvm/test/Transforms/LoopStrengthReduce/uglygep.ll (+13-19)
(modified) llvm/test/Transforms/LoopStrengthReduce/wrong-hoisting-iv.ll (+22-22)

diff --git a/llvm/lib/Transforms/Scalar/LoopStrengthReduce.cpp b/llvm/lib/Transforms/Scalar/LoopStrengthReduce.cpp
index 845afa6d4228b..7666950529583 100644
--- a/llvm/lib/Transforms/Scalar/LoopStrengthReduce.cpp
+++ b/llvm/lib/Transforms/Scalar/LoopStrengthReduce.cpp
@@ -2229,6 +2229,7 @@ class LSRInstance {
   void NarrowSearchSpaceByDeletingCostlyFormulas();
   void NarrowSearchSpaceByPickingWinnerRegs();
   void NarrowSearchSpaceUsingHeuristics();
+  bool SortLSRUses();
 
   void SolveRecurse(SmallVectorImpl<const Formula *> &Solution,
                     Cost &SolutionCost,
@@ -5368,6 +5369,46 @@ void LSRInstance::NarrowSearchSpaceUsingHeuristics() {
     NarrowSearchSpaceByPickingWinnerRegs();
 }
 
+/// Sort LSRUses to address side effects of compile time optimization done in
+/// SolveRecurse which filters out formulae not including required registers.
+/// Such optimization makes the found best solution sensitive to the order
+/// of LSRUses processing, hence it's important to ensure that that order
+/// isn't random to avoid fluctuations and sub-optimal results.
+///
+/// Also check that all LSRUses have formulae as otherwise the situation is
+/// unsolvable.
+bool LSRInstance::SortLSRUses() {
+  SmallVector<LSRUse *, 16> NewOrder;
+  for (auto &LU : Uses) {
+    if (!LU.Formulae.size()) {
+      return false;
+    }
+    NewOrder.push_back(&LU);
+  }
+
+  std::stable_sort(
+      NewOrder.begin(), NewOrder.end(), [](const LSRUse *L, const LSRUse *R) {
+        auto CalcKey = [](const LSRUse *LU) {
+          // LSRUses w/ many registers and formulae go first avoid too big
+          // reduction of considered solutions count
+          return std::tuple(LU->Regs.size(), LU->Formulae.size(), LU->Kind,
+                            LU->MinOffset.getKnownMinValue(),
+                            LU->MaxOffset.getKnownMinValue(),
+                            LU->AllFixupsOutsideLoop, LU->RigidFormula);
+        };
+        return CalcKey(L) > CalcKey(R);
+      });
+
+  SmallVector<LSRUse, 4> NewUses;
+  for (LSRUse *LU : NewOrder)
+    NewUses.push_back(std::move(*LU));
+  Uses = std::move(NewUses);
+
+  LLVM_DEBUG(dbgs() << "\nAfter sorting:\n"; print_uses(dbgs()));
+
+  return true;
+}
+
 /// This is the recursive solver.
 void LSRInstance::SolveRecurse(SmallVectorImpl<const Formula *> &Solution,
                                Cost &SolutionCost,
@@ -5387,6 +5428,10 @@ void LSRInstance::SolveRecurse(SmallVectorImpl<const Formula *> &Solution,
 
   const LSRUse &LU = Uses[Workspace.size()];
 
+  assert(LU.Formulae.size() &&
+         "LSRUse w/o formulae leads to unsolvable situation so it"
+         "shouldn't be here");
+
   // If this use references any register that's already a part of the
   // in-progress solution, consider it a requirement that a formula must
   // reference that register in order to be considered. This prunes out
@@ -5398,54 +5443,69 @@ void LSRInstance::SolveRecurse(SmallVectorImpl<const Formula *> &Solution,
 
   SmallPtrSet<const SCEV *, 16> NewRegs;
   Cost NewCost(L, SE, TTI, AMK);
-  for (const Formula &F : LU.Formulae) {
-    // Ignore formulae which may not be ideal in terms of register reuse of
-    // ReqRegs.  The formula should use all required registers before
-    // introducing new ones.
-    // This can sometimes (notably when trying to favour postinc) lead to
-    // sub-optimial decisions. There it is best left to the cost modelling to
-    // get correct.
-    if (AMK != TTI::AMK_PostIndexed || LU.Kind != LSRUse::Address) {
-      int NumReqRegsToFind = std::min(F.getNumRegs(), ReqRegs.size());
-      for (const SCEV *Reg : ReqRegs) {
-        if ((F.ScaledReg && F.ScaledReg == Reg) ||
-            is_contained(F.BaseRegs, Reg)) {
-          --NumReqRegsToFind;
-          if (NumReqRegsToFind == 0)
-            break;
+  bool FormulaeTested = false;
+  unsigned NumReqRegsToIgnore = 0;
+
+  while (!FormulaeTested) {
+    assert(
+        !NumReqRegsToIgnore ||
+        NumReqRegsToIgnore < ReqRegs.size() &&
+            "at least one formulae should have at least one required register");
+
+    for (const Formula &F : LU.Formulae) {
+      // ReqRegs. The formula should use required registers before
+      // introducing new ones. Firstly try the most aggressive option
+      // (when maximum of required registers are used) and then gradually make
+      // it weaker if all formulae don't satisfy this requirement.
+      //
+      // This can sometimes (notably when trying to favour postinc) lead to
+      // sub-optimal decisions. There it is best left to the cost modeling to
+      // get correct.
+      if (ReqRegs.size() &&
+          (AMK != TTI::AMK_PostIndexed || LU.Kind != LSRUse::Address)) {
+        unsigned NumReqRegsToFind = std::min(F.getNumRegs(), ReqRegs.size());
+        bool ReqRegsFound = false;
+        for (const SCEV *Reg : ReqRegs) {
+          if ((F.ScaledReg && F.ScaledReg == Reg) ||
+              is_contained(F.BaseRegs, Reg)) {
+            ReqRegsFound = true;
+            if (--NumReqRegsToFind == NumReqRegsToIgnore)
+              break;
+          }
+        }
+        if (!ReqRegsFound || NumReqRegsToFind != NumReqRegsToIgnore) {
+          continue;
         }
       }
-      if (NumReqRegsToFind != 0) {
-        // If none of the formulae satisfied the required registers, then we could
-        // clear ReqRegs and try again. Currently, we simply give up in this case.
-        continue;
-      }
-    }
 
-    // Evaluate the cost of the current formula. If it's already worse than
-    // the current best, prune the search at that point.
-    NewCost = CurCost;
-    NewRegs = CurRegs;
-    NewCost.RateFormula(F, NewRegs, VisitedRegs, LU);
-    if (NewCost.isLess(SolutionCost)) {
-      Workspace.push_back(&F);
-      if (Workspace.size() != Uses.size()) {
-        SolveRecurse(Solution, SolutionCost, Workspace, NewCost,
-                     NewRegs, VisitedRegs);
-        if (F.getNumRegs() == 1 && Workspace.size() == 1)
-          VisitedRegs.insert(F.ScaledReg ? F.ScaledReg : F.BaseRegs[0]);
-      } else {
-        LLVM_DEBUG(dbgs() << "New best at "; NewCost.print(dbgs());
-                   dbgs() << ".\nRegs:\n";
-                   for (const SCEV *S : NewRegs) dbgs()
-                      << "- " << *S << "\n";
-                   dbgs() << '\n');
-
-        SolutionCost = NewCost;
-        Solution = Workspace;
+      // Evaluate the cost of the current formula. If it's already worse than
+      // the current best, prune the search at that point.
+      FormulaeTested = true;
+      NewCost = CurCost;
+      NewRegs = CurRegs;
+      NewCost.RateFormula(F, NewRegs, VisitedRegs, LU);
+      if (NewCost.isLess(SolutionCost)) {
+        Workspace.push_back(&F);
+        if (Workspace.size() != Uses.size()) {
+          SolveRecurse(Solution, SolutionCost, Workspace, NewCost, NewRegs,
+                       VisitedRegs);
+          if (F.getNumRegs() == 1 && Workspace.size() == 1)
+            VisitedRegs.insert(F.ScaledReg ? F.ScaledReg : F.BaseRegs[0]);
+        } else {
+          LLVM_DEBUG(dbgs() << "New best at "; NewCost.print(dbgs());
+                     dbgs() << ".\nRegs:\n";
+                     for (const SCEV *S : NewRegs) dbgs() << "- " << *S << "\n";
+                     dbgs() << '\n');
+          SolutionCost = NewCost;
+          Solution = Workspace;
+        }
+        Workspace.pop_back();
       }
-      Workspace.pop_back();
     }
+
+    // none of formulae has necessary number of required registers - then
+    // make the requirement weaker
+    NumReqRegsToIgnore++;
   }
 }
 
@@ -6180,7 +6240,11 @@ LSRInstance::LSRInstance(Loop *L, IVUsers &IU, ScalarEvolution &SE,
   NarrowSearchSpaceUsingHeuristics();
 
   SmallVector<const Formula *, 8> Solution;
-  Solve(Solution);
+  bool LSRUsesConsistent = SortLSRUses();
+
+  if (LSRUsesConsistent) {
+    Solve(Solution);
+  }
 
   // Release memory that is no longer needed.
   Factors.clear();
diff --git a/llvm/test/CodeGen/AArch64/aarch64-p2align-max-bytes.ll b/llvm/test/CodeGen/AArch64/aarch64-p2align-max-bytes.ll
index 99e0d06bf4218..b54492b38ea79 100644
--- a/llvm/test/CodeGen/AArch64/aarch64-p2align-max-bytes.ll
+++ b/llvm/test/CodeGen/AArch64/aarch64-p2align-max-bytes.ll
@@ -16,14 +16,14 @@ define i32 @a(i32 %x, ptr nocapture readonly %y, ptr nocapture readonly %z) {
 ; CHECK-IMPLICIT:    .p2align 5
 ; CHECK-NEXT:  .LBB0_8: // %for.body
 ; CHECK-OBJ;Disassembly of section .text:
-; CHECK-OBJ:               88: 8b0a002a      add
+; CHECK-OBJ:               88: 8b0b002b      add
 ; CHECK-OBJ-IMPLICIT-NEXT: 8c: d503201f      nop
 ; CHECK-OBJ-IMPLICIT-NEXT: 90: d503201f      nop
 ; CHECK-OBJ-IMPLICIT-NEXT: 94: d503201f      nop
 ; CHECK-OBJ-IMPLICIT-NEXT: 98: d503201f      nop
 ; CHECK-OBJ-IMPLICIT-NEXT: 9c: d503201f      nop
-; CHECK-OBJ-IMPLICIT-NEXT: a0: b840454b      ldr
-; CHECK-OBJ-EXPLICIT-NEXT: 8c: b840454b      ldr
+; CHECK-OBJ-IMPLICIT-NEXT: a0: b8404569      ldr
+; CHECK-OBJ-EXPLICIT-NEXT: 8c: b8404569      ldr
 entry:
   %cmp10 = icmp sgt i32 %x, 0
   br i1 %cmp10, label %for.body.preheader, label %for.cond.cleanup
diff --git a/llvm/test/CodeGen/AArch64/machine-combiner-copy.ll b/llvm/test/CodeGen/AArch64/machine-combiner-copy.ll
index 4c8e589391c3a..8ffdc1e8dcb41 100644
--- a/llvm/test/CodeGen/AArch64/machine-combiner-copy.ll
+++ b/llvm/test/CodeGen/AArch64/machine-combiner-copy.ll
@@ -33,17 +33,17 @@ define void @fma_dup_f16(ptr noalias nocapture noundef readonly %A, half noundef
 ; CHECK-NEXT:    cmp x9, x8
 ; CHECK-NEXT:    b.eq .LBB0_8
 ; CHECK-NEXT:  .LBB0_6: // %for.body.preheader1
-; CHECK-NEXT:    lsl x10, x9, #1
+; CHECK-NEXT:    lsl x11, x9, #1
 ; CHECK-NEXT:    sub x8, x8, x9
-; CHECK-NEXT:    add x9, x1, x10
-; CHECK-NEXT:    add x10, x0, x10
+; CHECK-NEXT:    add x10, x1, x11
+; CHECK-NEXT:    add x11, x0, x11
 ; CHECK-NEXT:  .LBB0_7: // %for.body
 ; CHECK-NEXT:    // =>This Inner Loop Header: Depth=1
-; CHECK-NEXT:    ldr h1, [x10], #2
-; CHECK-NEXT:    ldr h2, [x9]
+; CHECK-NEXT:    ldr h1, [x11], #2
+; CHECK-NEXT:    ldr h2, [x10]
 ; CHECK-NEXT:    subs x8, x8, #1
 ; CHECK-NEXT:    fmadd h1, h1, h0, h2
-; CHECK-NEXT:    str h1, [x9], #2
+; CHECK-NEXT:    str h1, [x10], #2
 ; CHECK-NEXT:    b.ne .LBB0_7
 ; CHECK-NEXT:  .LBB0_8: // %for.cond.cleanup
 ; CHECK-NEXT:    ret
diff --git a/llvm/test/CodeGen/AArch64/machine-licm-sub-loop.ll b/llvm/test/CodeGen/AArch64/machine-licm-sub-loop.ll
index f6bbdf5d95d87..cb281518d1fd5 100644
--- a/llvm/test/CodeGen/AArch64/machine-licm-sub-loop.ll
+++ b/llvm/test/CodeGen/AArch64/machine-licm-sub-loop.ll
@@ -36,23 +36,23 @@ define void @foo(i32 noundef %limit, ptr %out, ptr %y) {
 ; CHECK-NEXT:  .LBB0_5: // %vector.ph
 ; CHECK-NEXT:    // in Loop: Header=BB0_3 Depth=1
 ; CHECK-NEXT:    dup v0.8h, w15
-; CHECK-NEXT:    mov x16, x14
-; CHECK-NEXT:    mov x17, x13
-; CHECK-NEXT:    mov x18, x12
+; CHECK-NEXT:    mov x16, x12
+; CHECK-NEXT:    mov x17, x14
+; CHECK-NEXT:    mov x18, x13
 ; CHECK-NEXT:  .LBB0_6: // %vector.body
 ; CHECK-NEXT:    // Parent Loop BB0_3 Depth=1
 ; CHECK-NEXT:    // => This Inner Loop Header: Depth=2
-; CHECK-NEXT:    ldp q1, q4, [x16, #-16]
-; CHECK-NEXT:    subs x18, x18, #16
-; CHECK-NEXT:    ldp q3, q2, [x17, #-32]
-; CHECK-NEXT:    add x16, x16, #32
-; CHECK-NEXT:    ldp q6, q5, [x17]
+; CHECK-NEXT:    ldp q1, q4, [x17, #-16]
+; CHECK-NEXT:    subs x16, x16, #16
+; CHECK-NEXT:    ldp q3, q2, [x18, #-32]
+; CHECK-NEXT:    add x17, x17, #32
+; CHECK-NEXT:    ldp q6, q5, [x18]
 ; CHECK-NEXT:    smlal2 v2.4s, v0.8h, v1.8h
 ; CHECK-NEXT:    smlal v3.4s, v0.4h, v1.4h
 ; CHECK-NEXT:    smlal2 v5.4s, v0.8h, v4.8h
 ; CHECK-NEXT:    smlal v6.4s, v0.4h, v4.4h
-; CHECK-NEXT:    stp q3, q2, [x17, #-32]
-; CHECK-NEXT:    stp q6, q5, [x17], #64
+; CHECK-NEXT:    stp q3, q2, [x18, #-32]
+; CHECK-NEXT:    stp q6, q5, [x18], #64
 ; CHECK-NEXT:    b.ne .LBB0_6
 ; CHECK-NEXT:  // %bb.7: // %middle.block
 ; CHECK-NEXT:    // in Loop: Header=BB0_3 Depth=1
diff --git a/llvm/test/CodeGen/AArch64/zext-to-tbl.ll b/llvm/test/CodeGen/AArch64/zext-to-tbl.ll
index 2a37183c47d51..bc78558ec398d 100644
--- a/llvm/test/CodeGen/AArch64/zext-to-tbl.ll
+++ b/llvm/test/CodeGen/AArch64/zext-to-tbl.ll
@@ -1666,15 +1666,15 @@ define void @zext_v8i8_to_v8i64_with_add_in_sequence_in_loop(ptr %src, ptr %dst)
 ; CHECK-NEXT:    ldr q0, [x8, lCPI17_0@PAGEOFF]
 ; CHECK-NEXT:  Lloh21:
 ; CHECK-NEXT:    ldr q1, [x9, lCPI17_1@PAGEOFF]
-; CHECK-NEXT:    add x8, x1, #64
-; CHECK-NEXT:    add x9, x0, #8
+; CHECK-NEXT:    add x8, x0, #8
+; CHECK-NEXT:    add x9, x1, #64
 ; CHECK-NEXT:  LBB17_1: ; %loop
 ; CHECK-NEXT:    ; =>This Inner Loop Header: Depth=1
-; CHECK-NEXT:    ldp d2, d3, [x9, #-8]
+; CHECK-NEXT:    ldp d2, d3, [x8, #-8]
 ; CHECK-NEXT:    subs x10, x10, #16
-; CHECK-NEXT:    ldp q7, q5, [x8, #-32]
-; CHECK-NEXT:    add x9, x9, #16
-; CHECK-NEXT:    ldp q17, q6, [x8, #-64]
+; CHECK-NEXT:    ldp q7, q5, [x9, #-32]
+; CHECK-NEXT:    add x8, x8, #16
+; CHECK-NEXT:    ldp q17, q6, [x9, #-64]
 ; CHECK-NEXT:    tbl.16b v4, { v2 }, v1
 ; CHECK-NEXT:    tbl.16b v2, { v2 }, v0
 ; CHECK-NEXT:    tbl.16b v16, { v3 }, v1
@@ -1682,17 +1682,17 @@ define void @zext_v8i8_to_v8i64_with_add_in_sequence_in_loop(ptr %src, ptr %dst)
 ; CHECK-NEXT:    uaddw2.2d v5, v5, v4
 ; CHECK-NEXT:    uaddw2.2d v6, v6, v2
 ; CHECK-NEXT:    uaddw.2d v4, v7, v4
-; CHECK-NEXT:    ldp q18, q7, [x8, #32]
+; CHECK-NEXT:    ldp q18, q7, [x9, #32]
 ; CHECK-NEXT:    uaddw.2d v2, v17, v2
-; CHECK-NEXT:    stp q4, q5, [x8, #-32]
+; CHECK-NEXT:    stp q4, q5, [x9, #-32]
 ; CHECK-NEXT:    uaddw2.2d v5, v7, v16
-; CHECK-NEXT:    stp q2, q6, [x8, #-64]
+; CHECK-NEXT:    stp q2, q6, [x9, #-64]
 ; CHECK-NEXT:    uaddw.2d v16, v18, v16
-; CHECK-NEXT:    ldp q7, q6, [x8]
-; CHECK-NEXT:    stp q16, q5, [x8, #32]
+; CHECK-NEXT:    ldp q7, q6, [x9]
+; CHECK-NEXT:    stp q16, q5, [x9, #32]
 ; CHECK-NEXT:    uaddw2.2d v4, v6, v3
 ; CHECK-NEXT:    uaddw.2d v2, v7, v3
-; CHECK-NEXT:    stp q2, q4, [x8], #128
+; CHECK-NEXT:    stp q2, q4, [x9], #128
 ; CHECK-NEXT:    b.ne LBB17_1
 ; CHECK-NEXT:  ; %bb.2: ; %exit
 ; CHECK-NEXT:    ret
@@ -1708,31 +1708,31 @@ define void @zext_v8i8_to_v8i64_with_add_in_sequence_in_loop(ptr %src, ptr %dst)
 ; CHECK-BE-NEXT:    adrp x9, .LCPI17_1
 ; CHECK-BE-NEXT:    add x9, x9, :lo12:.LCPI17_1
 ; CHECK-BE-NEXT:    ld1 { v1.16b }, [x9]
-; CHECK-BE-NEXT:    add x9, x1, #64
-; CHECK-BE-NEXT:    add x10, x0, #8
+; CHECK-BE-NEXT:    add x9, x0, #8
+; CHECK-BE-NEXT:    add x10, x1, #64
 ; CHECK-BE-NEXT:  .LBB17_1: // %loop
 ; CHECK-BE-NEXT:    // =>This Inner Loop Header: Depth=1
-; CHECK-BE-NEXT:    ld1 { v2.8b }, [x10]
-; CHECK-BE-NEXT:    sub x11, x10, #8
-; CHECK-BE-NEXT:    add x15, x9, #32
+; CHECK-BE-NEXT:    ld1 { v2.8b }, [x9]
+; CHECK-BE-NEXT:    sub x11, x9, #8
+; CHECK-BE-NEXT:    add x15, x10, #32
 ; CHECK-BE-NEXT:    ld1 { v3.8b }, [x11]
 ; CHECK-BE-NEXT:    ld1 { v16.2d }, [x15]
-; CHECK-BE-NEXT:    sub x11, x9, #64
-; CHECK-BE-NEXT:    sub x12, x9, #32
-; CHECK-BE-NEXT:    ld1 { v6.2d }, [x9]
+; CHECK-BE-NEXT:    sub x11, x10, #64
+; CHECK-BE-NEXT:    sub x12, x10, #32
+; CHECK-BE-NEXT:    ld1 { v6.2d }, [x10]
 ; CHECK-BE-NEXT:    ld1 { v21.2d }, [x11]
 ; CHECK-BE-NEXT:    tbl v4.16b, { v2.16b }, v1.16b
 ; CHECK-BE-NEXT:    tbl v2.16b, { v2.16b }, v0.16b
 ; CHECK-BE-NEXT:    ld1 { v19.2d }, [x12]
 ; CHECK-BE-NEXT:    tbl v5.16b, { v3.16b }, v1.16b
 ; CHECK-BE-NEXT:    tbl v3.16b, { v3.16b }, v0.16b
-; CHECK-BE-NEXT:    sub x13, x9, #16
-; CHECK-BE-NEXT:    sub x14, x9, #48
-; CHECK-BE-NEXT:    add x16, x9, #48
-; CHECK-BE-NEXT:    add x17, x9, #16
+; CHECK-BE-NEXT:    sub x13, x10, #16
+; CHECK-BE-NEXT:    sub x14, x10, #48
+; CHECK-BE-NEXT:    add x16, x10, #48
+; CHECK-BE-NEXT:    add x17, x10, #16
 ; CHECK-BE-NEXT:    ld1 { v22.2d }, [x13]
 ; CHECK-BE-NEXT:    subs x8, x8, #16
-; CHECK-BE-NEXT:    add x10, x10, #16
+; CHECK-BE-NEXT:    add x9, x9, #16
 ; CHECK-BE-NEXT:    rev32 v7.8b, v4.8b
 ; CHECK-BE-NEXT:    ext v4.16b, v4.16b, v4.16b, #8
 ; CHECK-BE-NEXT:    rev32 v17.8b, v2.8b
@@ -1753,8 +1753,8 @@ define void @zext_v8i8_to_v8i64_with_add_in_sequence_in_loop(ptr %src, ptr %dst)
 ; CHECK-BE-NEXT:    uaddw v3.2d, v21.2d, v3.2s
 ; CHECK-BE-NEXT:    st1 { v7.2d }, [x15]
 ; CHECK-BE-NEXT:    ld1 { v7.2d }, [x17]
-; CHECK-BE-NEXT:    st1 { v6.2d }, [x9]
-; CHECK-BE-NEXT:    add x9, x9, #128
+; CHECK-BE-NEXT:    st1 { v6.2d }, [x10]
+; CHECK-BE-NEXT:    add x10, x10, #128
 ; CHECK-BE-NEXT:    uaddw v4.2d, v16.2d, v4.2s
 ; CHECK-BE-NEXT:    st1 { v5.2d }, [x12]
 ; CHECK-BE-NEXT:    uaddw v5.2d, v22.2d, v17.2s
diff --git a/llvm/test/CodeGen/AMDGPU/idiv-licm.ll b/llvm/test/CodeGen/AMDGPU/idiv-licm.ll
index ecbf5dfeb3af1..d0a6255cd0395 100644
--- a/llvm/test/CodeGen/AMDGPU/idiv-licm.ll
+++ b/llvm/test/CodeGen/AMDGPU/idiv-licm.ll
@@ -24,27 +24,27 @@ define amdgpu_kernel void @udiv32_invariant_denom(ptr addrspace(1) nocapture %ar
 ; GFX9-NEXT:    s_mov_b64 s[4:5], 0
 ; GFX9-NEXT:  .LBB0_1: ; %bb3
 ; GFX9-NEXT:    ; =>This Inner Loop Header: Depth=1
-; GFX9-NEXT:    s_not_b32 s10, s5
-; GFX9-NEXT:    s_mul_i32 s9, s6, s5
+; GFX9-NEXT:    s_not_b32 s10, s3
+; GFX9-NEXT:    s_mul_i32 s9, s6, s3
 ; GFX9-NEXT:    s_mul_i32 s10, s6, s10
-; GFX9-NEXT:    s_add_i32 s11, s5, 1
+; GFX9-NEXT:    s_add_i32 s11, s3, 1
 ; GFX9-NEXT:    s_sub_i32 s9, s7, s9
 ; GFX9-NEXT:    s_add_i32 s10, s7, s10
 ; GFX9-NEXT:    s_cmp_ge_u32 s9, s6
-; GFX9-NEXT:    s_cselect_b32 s11, s11, s5
+; GFX9-NEXT:    s_cselect_b32 s11, s11, s3
 ; GFX9-NEXT:    s_cselect_b32 s9, s10, s9
 ; GFX9-NEXT:    s_add_i32 s10, s11, 1
 ; GFX9-NEXT:    s_cmp_ge_u32 s9, s6
 ; GFX9-NEXT:    s_cselect_b32 s9, s10, s11
-; GFX9-NEXT:    s_add_u32 s10, s0, s2
-; GFX9-NEXT:    s_addc_u32 s11, s1, s3
+; GFX9-NEXT:    s_add_u32 s10, s0, s4
+; GFX9-NEXT:    s_addc_u32 s11, s1, s5
 ; GFX9-NEXT:    s_add_i32 s7, s7, 1
-; GFX9-NEXT:    s_add_u32 s4, s4, s8
+; GFX9-NEXT:    s_add_u32 s4, s4, 4
 ; GFX9-NEXT:    s_addc_u32 s5, s5, 0
-; GFX9-NEXT:    s_add_u32 s2, s2, 4
+; GFX9-NEXT:    s_add_u32 s2, s2, s8
 ; GFX9-NEXT:    s_addc_u32 s3, s3, 0
 ; GFX9-NEXT:    v_mov_b32_e32 v1, s9
-; GFX9-NEXT:    s_cmpk_eq_i32 s2, 0x1000
+; GFX9-NEXT:    s_cmpk_eq_i32 s4, 0x1000
 ; GFX9-NEXT:    global_store_dword v0, v1, s[10:11]
 ; GFX9-NEXT:    s_cbranch_scc0 .LBB0_1
 ; GFX9-NEXT:  ; %bb.2: ; %bb2
@@ -72,27 +72,27 @@ define amdgpu_kernel void @udiv32_invariant_denom(ptr addrspace(1) nocapture %ar
 ; GFX10-NEXT:  .LBB0_1: ; %bb3
 ; GFX10-NEXT:    ; =>This Inner Loop Header: Depth=1
 ; GFX10-NEXT:    s_waitcnt_depctr 0xffe3
-; GFX10-NEXT:    s_not_b32 s10, s5
-; GFX10-NEXT:    s_mul_i32 s9, s6, s5
+; GFX10-NEXT:    s_not_b32 s10, s3
+; GFX10-NEXT:    s_mul_i32 s9, s6, s3
 ; GFX10-NEXT:    s_mul_i32 s10, s6, s10
 ; GFX10-NEXT:    s_sub_i32 s9, s7, s9
-; GFX10-NEXT:    s_add_i32 s11, s5, 1
+; GFX10-NEXT:    s_add_i32 s11, s3, 1
 ; GFX10-NEXT:    s_add_i32 s10, s7, s10
 ; GFX10-NEXT:    s_cmp_ge_u32 s9, s6
-; GFX10-NEXT:    s_cselect_b32 s11, s11, s5
+; GFX10-NEXT:    s_cselect_b32 s11, s11, s3
 ; GFX10-NEXT:    s_cselect_b32 s9, s10, s9
 ; GFX10-NEXT:    s_add_i32 s10, s11, 1
 ; GFX10-NEXT:    s_cmp_ge_u32 s9, s6
 ; GFX10-NEXT:    s_cselect_b32 s9, s10, s11
-; GFX10-NEXT:    s_add_u32 s10, s0, s2
-; GFX10-NEXT:    s_addc_u32 s11, s1, s3
+; GFX10-NEXT:    s_add_u32 s10, s0, s4
+; GFX10-NEXT:    s_addc_u32 s11, s1, s5
 ; GFX10-NEXT:    s_add_i32 s7, s7, 1
-; GFX10-NEXT:    s_add_u32 s4, s4, s8
+; GFX10-NEXT:    s_add_u32 s4, s4, 4
 ; GFX10-NEXT:    v_mov_b32_e32 v1, s9
 ; GFX10-NEXT:    s_addc_u32 s5, s5, 0
-; GFX10-NEXT:    s_add_u32 s2, s2, 4
+; GFX10-NEXT:    s_add_u32 s2, s2, s8
 ; GFX10-NEXT:    s_addc_u32 s3, s3, 0
-; GFX10-NEXT:    s_cmpk_eq_i32 s2, 0x1000
+; GFX10-NEXT:    s_cmpk_eq_i32 s4, 0x1000
 ; GFX10-NEXT:    global_store_dword v0, v1, s[10:11]
 ; GFX10-NEXT:    s_cbranch_scc0 .LBB0_1
 ; GFX10-NEXT:  ; %bb.2: ; %bb2
@@ -123,28 +123,27 @@ define amdgpu_kernel void @udiv32_invariant_denom(ptr addrspace(1) nocapture %ar
 ; GFX11-NEXT:    .p2align 6
 ; GFX11-NEXT:  .LBB0_1: ; %bb3
 ; GFX11-NEXT:    ; =>This Inner Loop Header: Depth=1
-; GFX11-NEXT:    s_delay_alu instid0(SALU_CYCLE_1)
-; GFX11-NEXT:    s_not_b32 s10, s5
-; GFX11-NEXT:    s_mul_i32 s9, s6, s5
+; GFX11-NEXT:    s_not_b32 s10, s3
+; GFX11-NEXT:    s_mul_i32 s9, s6, s3
 ; GFX11-NEXT:    s_mul_i32 s10, s6, s10
 ; GFX11-NEXT:    s_sub_i32 s9, s7, s9
-; GFX11-NEXT:    s_add_i32 s11, s5, 1
+; GFX11-NEXT:    s_ad...
[truncated]

llvmbot · 2025-07-08T19:47:47Z

@llvm/pr-subscribers-backend-powerpc

Author: Sergey Shcherbinin (SergeyShch01)

Changes

This optimization rejects formulas not containing registers already included into current solution and also included into other formulas of current LSRUse. It really helps to reduce compilation time but also has side effects which are mitigated/fixed as follows:

The best found solution becomes dependent on the LSRUses ordering since at any given time the registers included into the current solution (and used to filter formulas) are determined by processed LSRUses for which formulas to be included into current solution are already chosen. So the decisions made for LSRUses which are processed earlier directly affect the decisions made for LSRUses processed later. According to the current implementation the LSRUses ordering seems to be random as it's determined by the order of IV instructions' users traversal which can be assumed random. Besides directly affecting the performance of the generated code (which is randomized), this can also lead to performance fluctuations when the order of IV instructions' users traversal changes for some reason (in fact, it can even be different between clang and llc runs). In order to minimize the described effect given patch introduces LSRUses sorting which establishes more-or-less stable LSRUses ordering (fluctuations are still possible for LSRUses which look identical). This change leads to loop_reduce work changes in many regression tests which are updated accordingly.
The formula filtering in SolveRecurse rejects all formulas that do not have the maximum number of required registers, although it is possible that none of the formulas do (for example, it is possible to have two formulas with two registers and one required register each - then all formulas will be rejected because they do not have two required ones). The change fixes this by gradually decreasing the number of required registers if all formulas are rejected.

The change also introduces an early return if LSRUse w/o formulas is found as that is an unsolvable case.

Patch is 246.26 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/147588.diff

54 Files Affected:

(modified) llvm/lib/Transforms/Scalar/LoopStrengthReduce.cpp (+108-44)
(modified) llvm/test/CodeGen/AArch64/aarch64-p2align-max-bytes.ll (+3-3)
(modified) llvm/test/CodeGen/AArch64/machine-combiner-copy.ll (+6-6)
(modified) llvm/test/CodeGen/AArch64/machine-licm-sub-loop.ll (+10-10)
(modified) llvm/test/CodeGen/AArch64/zext-to-tbl.ll (+27-27)
(modified) llvm/test/CodeGen/AMDGPU/idiv-licm.ll (+48-50)
(modified) llvm/test/CodeGen/AMDGPU/memintrinsic-unroll.ll (+128-130)
(modified) llvm/test/CodeGen/AMDGPU/memmove-var-size.ll (+46-46)
(modified) llvm/test/CodeGen/AMDGPU/mul24-pass-ordering.ll (+11-11)
(modified) llvm/test/CodeGen/AMDGPU/noclobber-barrier.ll (+44-44)
(modified) llvm/test/CodeGen/ARM/2011-03-15-LdStMultipleBug.ll (+4-6)
(modified) llvm/test/CodeGen/ARM/dsp-loop-indexing.ll (+6-6)
(modified) llvm/test/CodeGen/ARM/fpclamptosat.ll (+34-26)
(modified) llvm/test/CodeGen/ARM/loop-indexing.ll (+1-1)
(modified) llvm/test/CodeGen/NVPTX/load-with-non-coherent-cache.ll (+2-2)
(modified) llvm/test/CodeGen/PowerPC/lsr-profitable-chain.ll (+32-32)
(modified) llvm/test/CodeGen/PowerPC/more-dq-form-prepare.ll (+56-56)
(modified) llvm/test/CodeGen/RISCV/riscv-codegenprepare-asm.ll (+10-7)
(modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-strided-load-store-asm.ll (+2-2)
(modified) llvm/test/CodeGen/Thumb2/LowOverheadLoops/mve-float-loops.ll (+44-44)
(modified) llvm/test/CodeGen/Thumb2/LowOverheadLoops/mve-tail-data-types.ll (+12-12)
(modified) llvm/test/CodeGen/Thumb2/mve-blockplacement.ll (+17-17)
(modified) llvm/test/CodeGen/Thumb2/mve-float16regloops.ll (+7-7)
(modified) llvm/test/CodeGen/Thumb2/mve-float32regloops.ll (+7-7)
(modified) llvm/test/CodeGen/Thumb2/mve-memtp-loop.ll (+27-27)
(modified) llvm/test/CodeGen/WebAssembly/unrolled-mem-indices.ll (+84-77)
(modified) llvm/test/CodeGen/X86/apx/check-nf-in-suppress-reloc-pass.ll (+1-1)
(modified) llvm/test/CodeGen/X86/avx512vnni-combine.ll (+33-25)
(modified) llvm/test/CodeGen/X86/avxvnni-combine.ll (+138-106)
(modified) llvm/test/CodeGen/X86/dag-update-nodetomatch.ll (+37-32)
(modified) llvm/test/CodeGen/X86/loop-strength-reduce7.ll (+5-5)
(modified) llvm/test/CodeGen/X86/masked-iv-safe.ll (+14-14)
(modified) llvm/test/CodeGen/X86/optimize-max-0.ll (+6-6)
(modified) llvm/test/CodeGen/X86/pr42565.ll (+6-6)
(modified) llvm/test/CodeGen/X86/ragreedy-hoist-spill.ll (+1-1)
(modified) llvm/test/Transforms/LoopStrengthReduce/AArch64/vscale-fixups.ll (+42-112)
(modified) llvm/test/Transforms/LoopStrengthReduce/AMDGPU/atomics.ll (+14-14)
(modified) llvm/test/Transforms/LoopStrengthReduce/AMDGPU/different-addrspace-addressing-mode-loops.ll (+14-14)
(modified) llvm/test/Transforms/LoopStrengthReduce/AMDGPU/lsr-postinc-pos-addrspace.ll (+6-6)
(modified) llvm/test/Transforms/LoopStrengthReduce/ARM/complexity.ll (+1-1)
(modified) llvm/test/Transforms/LoopStrengthReduce/ARM/illegal-addr-modes.ll (+1-2)
(modified) llvm/test/Transforms/LoopStrengthReduce/RISCV/many-geps.ll (+15-15)
(modified) llvm/test/Transforms/LoopStrengthReduce/X86/ivchain-X86.ll (+10-10)
(modified) llvm/test/Transforms/LoopStrengthReduce/X86/missing-phi-operand-update.ll (+12-12)
(modified) llvm/test/Transforms/LoopStrengthReduce/X86/nested-loop.ll (+9-9)
(modified) llvm/test/Transforms/LoopStrengthReduce/X86/normalization-during-scev-expansion.ll (+24-24)
(modified) llvm/test/Transforms/LoopStrengthReduce/X86/postinc-iv-used-by-urem-and-udiv.ll (+23-24)
(modified) llvm/test/Transforms/LoopStrengthReduce/X86/pr62660-normalization-failure.ll (+6-7)
(modified) llvm/test/Transforms/LoopStrengthReduce/depth-limit-overrun.ll (+4-4)
(modified) llvm/test/Transforms/LoopStrengthReduce/ivincs-hoist.ll (+3-3)
(modified) llvm/test/Transforms/LoopStrengthReduce/lsr-overflow.ll (+5-5)
(modified) llvm/test/Transforms/LoopStrengthReduce/lsr-term-fold-negative-testcase.ll (+6-6)
(modified) llvm/test/Transforms/LoopStrengthReduce/uglygep.ll (+13-19)
(modified) llvm/test/Transforms/LoopStrengthReduce/wrong-hoisting-iv.ll (+22-22)

diff --git a/llvm/lib/Transforms/Scalar/LoopStrengthReduce.cpp b/llvm/lib/Transforms/Scalar/LoopStrengthReduce.cpp
index 845afa6d4228b..7666950529583 100644
--- a/llvm/lib/Transforms/Scalar/LoopStrengthReduce.cpp
+++ b/llvm/lib/Transforms/Scalar/LoopStrengthReduce.cpp
@@ -2229,6 +2229,7 @@ class LSRInstance {
   void NarrowSearchSpaceByDeletingCostlyFormulas();
   void NarrowSearchSpaceByPickingWinnerRegs();
   void NarrowSearchSpaceUsingHeuristics();
+  bool SortLSRUses();
 
   void SolveRecurse(SmallVectorImpl<const Formula *> &Solution,
                     Cost &SolutionCost,
@@ -5368,6 +5369,46 @@ void LSRInstance::NarrowSearchSpaceUsingHeuristics() {
     NarrowSearchSpaceByPickingWinnerRegs();
 }
 
+/// Sort LSRUses to address side effects of compile time optimization done in
+/// SolveRecurse which filters out formulae not including required registers.
+/// Such optimization makes the found best solution sensitive to the order
+/// of LSRUses processing, hence it's important to ensure that that order
+/// isn't random to avoid fluctuations and sub-optimal results.
+///
+/// Also check that all LSRUses have formulae as otherwise the situation is
+/// unsolvable.
+bool LSRInstance::SortLSRUses() {
+  SmallVector<LSRUse *, 16> NewOrder;
+  for (auto &LU : Uses) {
+    if (!LU.Formulae.size()) {
+      return false;
+    }
+    NewOrder.push_back(&LU);
+  }
+
+  std::stable_sort(
+      NewOrder.begin(), NewOrder.end(), [](const LSRUse *L, const LSRUse *R) {
+        auto CalcKey = [](const LSRUse *LU) {
+          // LSRUses w/ many registers and formulae go first avoid too big
+          // reduction of considered solutions count
+          return std::tuple(LU->Regs.size(), LU->Formulae.size(), LU->Kind,
+                            LU->MinOffset.getKnownMinValue(),
+                            LU->MaxOffset.getKnownMinValue(),
+                            LU->AllFixupsOutsideLoop, LU->RigidFormula);
+        };
+        return CalcKey(L) > CalcKey(R);
+      });
+
+  SmallVector<LSRUse, 4> NewUses;
+  for (LSRUse *LU : NewOrder)
+    NewUses.push_back(std::move(*LU));
+  Uses = std::move(NewUses);
+
+  LLVM_DEBUG(dbgs() << "\nAfter sorting:\n"; print_uses(dbgs()));
+
+  return true;
+}
+
 /// This is the recursive solver.
 void LSRInstance::SolveRecurse(SmallVectorImpl<const Formula *> &Solution,
                                Cost &SolutionCost,
@@ -5387,6 +5428,10 @@ void LSRInstance::SolveRecurse(SmallVectorImpl<const Formula *> &Solution,
 
   const LSRUse &LU = Uses[Workspace.size()];
 
+  assert(LU.Formulae.size() &&
+         "LSRUse w/o formulae leads to unsolvable situation so it"
+         "shouldn't be here");
+
   // If this use references any register that's already a part of the
   // in-progress solution, consider it a requirement that a formula must
   // reference that register in order to be considered. This prunes out
@@ -5398,54 +5443,69 @@ void LSRInstance::SolveRecurse(SmallVectorImpl<const Formula *> &Solution,
 
   SmallPtrSet<const SCEV *, 16> NewRegs;
   Cost NewCost(L, SE, TTI, AMK);
-  for (const Formula &F : LU.Formulae) {
-    // Ignore formulae which may not be ideal in terms of register reuse of
-    // ReqRegs.  The formula should use all required registers before
-    // introducing new ones.
-    // This can sometimes (notably when trying to favour postinc) lead to
-    // sub-optimial decisions. There it is best left to the cost modelling to
-    // get correct.
-    if (AMK != TTI::AMK_PostIndexed || LU.Kind != LSRUse::Address) {
-      int NumReqRegsToFind = std::min(F.getNumRegs(), ReqRegs.size());
-      for (const SCEV *Reg : ReqRegs) {
-        if ((F.ScaledReg && F.ScaledReg == Reg) ||
-            is_contained(F.BaseRegs, Reg)) {
-          --NumReqRegsToFind;
-          if (NumReqRegsToFind == 0)
-            break;
+  bool FormulaeTested = false;
+  unsigned NumReqRegsToIgnore = 0;
+
+  while (!FormulaeTested) {
+    assert(
+        !NumReqRegsToIgnore ||
+        NumReqRegsToIgnore < ReqRegs.size() &&
+            "at least one formulae should have at least one required register");
+
+    for (const Formula &F : LU.Formulae) {
+      // ReqRegs. The formula should use required registers before
+      // introducing new ones. Firstly try the most aggressive option
+      // (when maximum of required registers are used) and then gradually make
+      // it weaker if all formulae don't satisfy this requirement.
+      //
+      // This can sometimes (notably when trying to favour postinc) lead to
+      // sub-optimal decisions. There it is best left to the cost modeling to
+      // get correct.
+      if (ReqRegs.size() &&
+          (AMK != TTI::AMK_PostIndexed || LU.Kind != LSRUse::Address)) {
+        unsigned NumReqRegsToFind = std::min(F.getNumRegs(), ReqRegs.size());
+        bool ReqRegsFound = false;
+        for (const SCEV *Reg : ReqRegs) {
+          if ((F.ScaledReg && F.ScaledReg == Reg) ||
+              is_contained(F.BaseRegs, Reg)) {
+            ReqRegsFound = true;
+            if (--NumReqRegsToFind == NumReqRegsToIgnore)
+              break;
+          }
+        }
+        if (!ReqRegsFound || NumReqRegsToFind != NumReqRegsToIgnore) {
+          continue;
         }
       }
-      if (NumReqRegsToFind != 0) {
-        // If none of the formulae satisfied the required registers, then we could
-        // clear ReqRegs and try again. Currently, we simply give up in this case.
-        continue;
-      }
-    }
 
-    // Evaluate the cost of the current formula. If it's already worse than
-    // the current best, prune the search at that point.
-    NewCost = CurCost;
-    NewRegs = CurRegs;
-    NewCost.RateFormula(F, NewRegs, VisitedRegs, LU);
-    if (NewCost.isLess(SolutionCost)) {
-      Workspace.push_back(&F);
-      if (Workspace.size() != Uses.size()) {
-        SolveRecurse(Solution, SolutionCost, Workspace, NewCost,
-                     NewRegs, VisitedRegs);
-        if (F.getNumRegs() == 1 && Workspace.size() == 1)
-          VisitedRegs.insert(F.ScaledReg ? F.ScaledReg : F.BaseRegs[0]);
-      } else {
-        LLVM_DEBUG(dbgs() << "New best at "; NewCost.print(dbgs());
-                   dbgs() << ".\nRegs:\n";
-                   for (const SCEV *S : NewRegs) dbgs()
-                      << "- " << *S << "\n";
-                   dbgs() << '\n');
-
-        SolutionCost = NewCost;
-        Solution = Workspace;
+      // Evaluate the cost of the current formula. If it's already worse than
+      // the current best, prune the search at that point.
+      FormulaeTested = true;
+      NewCost = CurCost;
+      NewRegs = CurRegs;
+      NewCost.RateFormula(F, NewRegs, VisitedRegs, LU);
+      if (NewCost.isLess(SolutionCost)) {
+        Workspace.push_back(&F);
+        if (Workspace.size() != Uses.size()) {
+          SolveRecurse(Solution, SolutionCost, Workspace, NewCost, NewRegs,
+                       VisitedRegs);
+          if (F.getNumRegs() == 1 && Workspace.size() == 1)
+            VisitedRegs.insert(F.ScaledReg ? F.ScaledReg : F.BaseRegs[0]);
+        } else {
+          LLVM_DEBUG(dbgs() << "New best at "; NewCost.print(dbgs());
+                     dbgs() << ".\nRegs:\n";
+                     for (const SCEV *S : NewRegs) dbgs() << "- " << *S << "\n";
+                     dbgs() << '\n');
+          SolutionCost = NewCost;
+          Solution = Workspace;
+        }
+        Workspace.pop_back();
       }
-      Workspace.pop_back();
     }
+
+    // none of formulae has necessary number of required registers - then
+    // make the requirement weaker
+    NumReqRegsToIgnore++;
   }
 }
 
@@ -6180,7 +6240,11 @@ LSRInstance::LSRInstance(Loop *L, IVUsers &IU, ScalarEvolution &SE,
   NarrowSearchSpaceUsingHeuristics();
 
   SmallVector<const Formula *, 8> Solution;
-  Solve(Solution);
+  bool LSRUsesConsistent = SortLSRUses();
+
+  if (LSRUsesConsistent) {
+    Solve(Solution);
+  }
 
   // Release memory that is no longer needed.
   Factors.clear();
diff --git a/llvm/test/CodeGen/AArch64/aarch64-p2align-max-bytes.ll b/llvm/test/CodeGen/AArch64/aarch64-p2align-max-bytes.ll
index 99e0d06bf4218..b54492b38ea79 100644
--- a/llvm/test/CodeGen/AArch64/aarch64-p2align-max-bytes.ll
+++ b/llvm/test/CodeGen/AArch64/aarch64-p2align-max-bytes.ll
@@ -16,14 +16,14 @@ define i32 @a(i32 %x, ptr nocapture readonly %y, ptr nocapture readonly %z) {
 ; CHECK-IMPLICIT:    .p2align 5
 ; CHECK-NEXT:  .LBB0_8: // %for.body
 ; CHECK-OBJ;Disassembly of section .text:
-; CHECK-OBJ:               88: 8b0a002a      add
+; CHECK-OBJ:               88: 8b0b002b      add
 ; CHECK-OBJ-IMPLICIT-NEXT: 8c: d503201f      nop
 ; CHECK-OBJ-IMPLICIT-NEXT: 90: d503201f      nop
 ; CHECK-OBJ-IMPLICIT-NEXT: 94: d503201f      nop
 ; CHECK-OBJ-IMPLICIT-NEXT: 98: d503201f      nop
 ; CHECK-OBJ-IMPLICIT-NEXT: 9c: d503201f      nop
-; CHECK-OBJ-IMPLICIT-NEXT: a0: b840454b      ldr
-; CHECK-OBJ-EXPLICIT-NEXT: 8c: b840454b      ldr
+; CHECK-OBJ-IMPLICIT-NEXT: a0: b8404569      ldr
+; CHECK-OBJ-EXPLICIT-NEXT: 8c: b8404569      ldr
 entry:
   %cmp10 = icmp sgt i32 %x, 0
   br i1 %cmp10, label %for.body.preheader, label %for.cond.cleanup
diff --git a/llvm/test/CodeGen/AArch64/machine-combiner-copy.ll b/llvm/test/CodeGen/AArch64/machine-combiner-copy.ll
index 4c8e589391c3a..8ffdc1e8dcb41 100644
--- a/llvm/test/CodeGen/AArch64/machine-combiner-copy.ll
+++ b/llvm/test/CodeGen/AArch64/machine-combiner-copy.ll
@@ -33,17 +33,17 @@ define void @fma_dup_f16(ptr noalias nocapture noundef readonly %A, half noundef
 ; CHECK-NEXT:    cmp x9, x8
 ; CHECK-NEXT:    b.eq .LBB0_8
 ; CHECK-NEXT:  .LBB0_6: // %for.body.preheader1
-; CHECK-NEXT:    lsl x10, x9, #1
+; CHECK-NEXT:    lsl x11, x9, #1
 ; CHECK-NEXT:    sub x8, x8, x9
-; CHECK-NEXT:    add x9, x1, x10
-; CHECK-NEXT:    add x10, x0, x10
+; CHECK-NEXT:    add x10, x1, x11
+; CHECK-NEXT:    add x11, x0, x11
 ; CHECK-NEXT:  .LBB0_7: // %for.body
 ; CHECK-NEXT:    // =>This Inner Loop Header: Depth=1
-; CHECK-NEXT:    ldr h1, [x10], #2
-; CHECK-NEXT:    ldr h2, [x9]
+; CHECK-NEXT:    ldr h1, [x11], #2
+; CHECK-NEXT:    ldr h2, [x10]
 ; CHECK-NEXT:    subs x8, x8, #1
 ; CHECK-NEXT:    fmadd h1, h1, h0, h2
-; CHECK-NEXT:    str h1, [x9], #2
+; CHECK-NEXT:    str h1, [x10], #2
 ; CHECK-NEXT:    b.ne .LBB0_7
 ; CHECK-NEXT:  .LBB0_8: // %for.cond.cleanup
 ; CHECK-NEXT:    ret
diff --git a/llvm/test/CodeGen/AArch64/machine-licm-sub-loop.ll b/llvm/test/CodeGen/AArch64/machine-licm-sub-loop.ll
index f6bbdf5d95d87..cb281518d1fd5 100644
--- a/llvm/test/CodeGen/AArch64/machine-licm-sub-loop.ll
+++ b/llvm/test/CodeGen/AArch64/machine-licm-sub-loop.ll
@@ -36,23 +36,23 @@ define void @foo(i32 noundef %limit, ptr %out, ptr %y) {
 ; CHECK-NEXT:  .LBB0_5: // %vector.ph
 ; CHECK-NEXT:    // in Loop: Header=BB0_3 Depth=1
 ; CHECK-NEXT:    dup v0.8h, w15
-; CHECK-NEXT:    mov x16, x14
-; CHECK-NEXT:    mov x17, x13
-; CHECK-NEXT:    mov x18, x12
+; CHECK-NEXT:    mov x16, x12
+; CHECK-NEXT:    mov x17, x14
+; CHECK-NEXT:    mov x18, x13
 ; CHECK-NEXT:  .LBB0_6: // %vector.body
 ; CHECK-NEXT:    // Parent Loop BB0_3 Depth=1
 ; CHECK-NEXT:    // => This Inner Loop Header: Depth=2
-; CHECK-NEXT:    ldp q1, q4, [x16, #-16]
-; CHECK-NEXT:    subs x18, x18, #16
-; CHECK-NEXT:    ldp q3, q2, [x17, #-32]
-; CHECK-NEXT:    add x16, x16, #32
-; CHECK-NEXT:    ldp q6, q5, [x17]
+; CHECK-NEXT:    ldp q1, q4, [x17, #-16]
+; CHECK-NEXT:    subs x16, x16, #16
+; CHECK-NEXT:    ldp q3, q2, [x18, #-32]
+; CHECK-NEXT:    add x17, x17, #32
+; CHECK-NEXT:    ldp q6, q5, [x18]
 ; CHECK-NEXT:    smlal2 v2.4s, v0.8h, v1.8h
 ; CHECK-NEXT:    smlal v3.4s, v0.4h, v1.4h
 ; CHECK-NEXT:    smlal2 v5.4s, v0.8h, v4.8h
 ; CHECK-NEXT:    smlal v6.4s, v0.4h, v4.4h
-; CHECK-NEXT:    stp q3, q2, [x17, #-32]
-; CHECK-NEXT:    stp q6, q5, [x17], #64
+; CHECK-NEXT:    stp q3, q2, [x18, #-32]
+; CHECK-NEXT:    stp q6, q5, [x18], #64
 ; CHECK-NEXT:    b.ne .LBB0_6
 ; CHECK-NEXT:  // %bb.7: // %middle.block
 ; CHECK-NEXT:    // in Loop: Header=BB0_3 Depth=1
diff --git a/llvm/test/CodeGen/AArch64/zext-to-tbl.ll b/llvm/test/CodeGen/AArch64/zext-to-tbl.ll
index 2a37183c47d51..bc78558ec398d 100644
--- a/llvm/test/CodeGen/AArch64/zext-to-tbl.ll
+++ b/llvm/test/CodeGen/AArch64/zext-to-tbl.ll
@@ -1666,15 +1666,15 @@ define void @zext_v8i8_to_v8i64_with_add_in_sequence_in_loop(ptr %src, ptr %dst)
 ; CHECK-NEXT:    ldr q0, [x8, lCPI17_0@PAGEOFF]
 ; CHECK-NEXT:  Lloh21:
 ; CHECK-NEXT:    ldr q1, [x9, lCPI17_1@PAGEOFF]
-; CHECK-NEXT:    add x8, x1, #64
-; CHECK-NEXT:    add x9, x0, #8
+; CHECK-NEXT:    add x8, x0, #8
+; CHECK-NEXT:    add x9, x1, #64
 ; CHECK-NEXT:  LBB17_1: ; %loop
 ; CHECK-NEXT:    ; =>This Inner Loop Header: Depth=1
-; CHECK-NEXT:    ldp d2, d3, [x9, #-8]
+; CHECK-NEXT:    ldp d2, d3, [x8, #-8]
 ; CHECK-NEXT:    subs x10, x10, #16
-; CHECK-NEXT:    ldp q7, q5, [x8, #-32]
-; CHECK-NEXT:    add x9, x9, #16
-; CHECK-NEXT:    ldp q17, q6, [x8, #-64]
+; CHECK-NEXT:    ldp q7, q5, [x9, #-32]
+; CHECK-NEXT:    add x8, x8, #16
+; CHECK-NEXT:    ldp q17, q6, [x9, #-64]
 ; CHECK-NEXT:    tbl.16b v4, { v2 }, v1
 ; CHECK-NEXT:    tbl.16b v2, { v2 }, v0
 ; CHECK-NEXT:    tbl.16b v16, { v3 }, v1
@@ -1682,17 +1682,17 @@ define void @zext_v8i8_to_v8i64_with_add_in_sequence_in_loop(ptr %src, ptr %dst)
 ; CHECK-NEXT:    uaddw2.2d v5, v5, v4
 ; CHECK-NEXT:    uaddw2.2d v6, v6, v2
 ; CHECK-NEXT:    uaddw.2d v4, v7, v4
-; CHECK-NEXT:    ldp q18, q7, [x8, #32]
+; CHECK-NEXT:    ldp q18, q7, [x9, #32]
 ; CHECK-NEXT:    uaddw.2d v2, v17, v2
-; CHECK-NEXT:    stp q4, q5, [x8, #-32]
+; CHECK-NEXT:    stp q4, q5, [x9, #-32]
 ; CHECK-NEXT:    uaddw2.2d v5, v7, v16
-; CHECK-NEXT:    stp q2, q6, [x8, #-64]
+; CHECK-NEXT:    stp q2, q6, [x9, #-64]
 ; CHECK-NEXT:    uaddw.2d v16, v18, v16
-; CHECK-NEXT:    ldp q7, q6, [x8]
-; CHECK-NEXT:    stp q16, q5, [x8, #32]
+; CHECK-NEXT:    ldp q7, q6, [x9]
+; CHECK-NEXT:    stp q16, q5, [x9, #32]
 ; CHECK-NEXT:    uaddw2.2d v4, v6, v3
 ; CHECK-NEXT:    uaddw.2d v2, v7, v3
-; CHECK-NEXT:    stp q2, q4, [x8], #128
+; CHECK-NEXT:    stp q2, q4, [x9], #128
 ; CHECK-NEXT:    b.ne LBB17_1
 ; CHECK-NEXT:  ; %bb.2: ; %exit
 ; CHECK-NEXT:    ret
@@ -1708,31 +1708,31 @@ define void @zext_v8i8_to_v8i64_with_add_in_sequence_in_loop(ptr %src, ptr %dst)
 ; CHECK-BE-NEXT:    adrp x9, .LCPI17_1
 ; CHECK-BE-NEXT:    add x9, x9, :lo12:.LCPI17_1
 ; CHECK-BE-NEXT:    ld1 { v1.16b }, [x9]
-; CHECK-BE-NEXT:    add x9, x1, #64
-; CHECK-BE-NEXT:    add x10, x0, #8
+; CHECK-BE-NEXT:    add x9, x0, #8
+; CHECK-BE-NEXT:    add x10, x1, #64
 ; CHECK-BE-NEXT:  .LBB17_1: // %loop
 ; CHECK-BE-NEXT:    // =>This Inner Loop Header: Depth=1
-; CHECK-BE-NEXT:    ld1 { v2.8b }, [x10]
-; CHECK-BE-NEXT:    sub x11, x10, #8
-; CHECK-BE-NEXT:    add x15, x9, #32
+; CHECK-BE-NEXT:    ld1 { v2.8b }, [x9]
+; CHECK-BE-NEXT:    sub x11, x9, #8
+; CHECK-BE-NEXT:    add x15, x10, #32
 ; CHECK-BE-NEXT:    ld1 { v3.8b }, [x11]
 ; CHECK-BE-NEXT:    ld1 { v16.2d }, [x15]
-; CHECK-BE-NEXT:    sub x11, x9, #64
-; CHECK-BE-NEXT:    sub x12, x9, #32
-; CHECK-BE-NEXT:    ld1 { v6.2d }, [x9]
+; CHECK-BE-NEXT:    sub x11, x10, #64
+; CHECK-BE-NEXT:    sub x12, x10, #32
+; CHECK-BE-NEXT:    ld1 { v6.2d }, [x10]
 ; CHECK-BE-NEXT:    ld1 { v21.2d }, [x11]
 ; CHECK-BE-NEXT:    tbl v4.16b, { v2.16b }, v1.16b
 ; CHECK-BE-NEXT:    tbl v2.16b, { v2.16b }, v0.16b
 ; CHECK-BE-NEXT:    ld1 { v19.2d }, [x12]
 ; CHECK-BE-NEXT:    tbl v5.16b, { v3.16b }, v1.16b
 ; CHECK-BE-NEXT:    tbl v3.16b, { v3.16b }, v0.16b
-; CHECK-BE-NEXT:    sub x13, x9, #16
-; CHECK-BE-NEXT:    sub x14, x9, #48
-; CHECK-BE-NEXT:    add x16, x9, #48
-; CHECK-BE-NEXT:    add x17, x9, #16
+; CHECK-BE-NEXT:    sub x13, x10, #16
+; CHECK-BE-NEXT:    sub x14, x10, #48
+; CHECK-BE-NEXT:    add x16, x10, #48
+; CHECK-BE-NEXT:    add x17, x10, #16
 ; CHECK-BE-NEXT:    ld1 { v22.2d }, [x13]
 ; CHECK-BE-NEXT:    subs x8, x8, #16
-; CHECK-BE-NEXT:    add x10, x10, #16
+; CHECK-BE-NEXT:    add x9, x9, #16
 ; CHECK-BE-NEXT:    rev32 v7.8b, v4.8b
 ; CHECK-BE-NEXT:    ext v4.16b, v4.16b, v4.16b, #8
 ; CHECK-BE-NEXT:    rev32 v17.8b, v2.8b
@@ -1753,8 +1753,8 @@ define void @zext_v8i8_to_v8i64_with_add_in_sequence_in_loop(ptr %src, ptr %dst)
 ; CHECK-BE-NEXT:    uaddw v3.2d, v21.2d, v3.2s
 ; CHECK-BE-NEXT:    st1 { v7.2d }, [x15]
 ; CHECK-BE-NEXT:    ld1 { v7.2d }, [x17]
-; CHECK-BE-NEXT:    st1 { v6.2d }, [x9]
-; CHECK-BE-NEXT:    add x9, x9, #128
+; CHECK-BE-NEXT:    st1 { v6.2d }, [x10]
+; CHECK-BE-NEXT:    add x10, x10, #128
 ; CHECK-BE-NEXT:    uaddw v4.2d, v16.2d, v4.2s
 ; CHECK-BE-NEXT:    st1 { v5.2d }, [x12]
 ; CHECK-BE-NEXT:    uaddw v5.2d, v22.2d, v17.2s
diff --git a/llvm/test/CodeGen/AMDGPU/idiv-licm.ll b/llvm/test/CodeGen/AMDGPU/idiv-licm.ll
index ecbf5dfeb3af1..d0a6255cd0395 100644
--- a/llvm/test/CodeGen/AMDGPU/idiv-licm.ll
+++ b/llvm/test/CodeGen/AMDGPU/idiv-licm.ll
@@ -24,27 +24,27 @@ define amdgpu_kernel void @udiv32_invariant_denom(ptr addrspace(1) nocapture %ar
 ; GFX9-NEXT:    s_mov_b64 s[4:5], 0
 ; GFX9-NEXT:  .LBB0_1: ; %bb3
 ; GFX9-NEXT:    ; =>This Inner Loop Header: Depth=1
-; GFX9-NEXT:    s_not_b32 s10, s5
-; GFX9-NEXT:    s_mul_i32 s9, s6, s5
+; GFX9-NEXT:    s_not_b32 s10, s3
+; GFX9-NEXT:    s_mul_i32 s9, s6, s3
 ; GFX9-NEXT:    s_mul_i32 s10, s6, s10
-; GFX9-NEXT:    s_add_i32 s11, s5, 1
+; GFX9-NEXT:    s_add_i32 s11, s3, 1
 ; GFX9-NEXT:    s_sub_i32 s9, s7, s9
 ; GFX9-NEXT:    s_add_i32 s10, s7, s10
 ; GFX9-NEXT:    s_cmp_ge_u32 s9, s6
-; GFX9-NEXT:    s_cselect_b32 s11, s11, s5
+; GFX9-NEXT:    s_cselect_b32 s11, s11, s3
 ; GFX9-NEXT:    s_cselect_b32 s9, s10, s9
 ; GFX9-NEXT:    s_add_i32 s10, s11, 1
 ; GFX9-NEXT:    s_cmp_ge_u32 s9, s6
 ; GFX9-NEXT:    s_cselect_b32 s9, s10, s11
-; GFX9-NEXT:    s_add_u32 s10, s0, s2
-; GFX9-NEXT:    s_addc_u32 s11, s1, s3
+; GFX9-NEXT:    s_add_u32 s10, s0, s4
+; GFX9-NEXT:    s_addc_u32 s11, s1, s5
 ; GFX9-NEXT:    s_add_i32 s7, s7, 1
-; GFX9-NEXT:    s_add_u32 s4, s4, s8
+; GFX9-NEXT:    s_add_u32 s4, s4, 4
 ; GFX9-NEXT:    s_addc_u32 s5, s5, 0
-; GFX9-NEXT:    s_add_u32 s2, s2, 4
+; GFX9-NEXT:    s_add_u32 s2, s2, s8
 ; GFX9-NEXT:    s_addc_u32 s3, s3, 0
 ; GFX9-NEXT:    v_mov_b32_e32 v1, s9
-; GFX9-NEXT:    s_cmpk_eq_i32 s2, 0x1000
+; GFX9-NEXT:    s_cmpk_eq_i32 s4, 0x1000
 ; GFX9-NEXT:    global_store_dword v0, v1, s[10:11]
 ; GFX9-NEXT:    s_cbranch_scc0 .LBB0_1
 ; GFX9-NEXT:  ; %bb.2: ; %bb2
@@ -72,27 +72,27 @@ define amdgpu_kernel void @udiv32_invariant_denom(ptr addrspace(1) nocapture %ar
 ; GFX10-NEXT:  .LBB0_1: ; %bb3
 ; GFX10-NEXT:    ; =>This Inner Loop Header: Depth=1
 ; GFX10-NEXT:    s_waitcnt_depctr 0xffe3
-; GFX10-NEXT:    s_not_b32 s10, s5
-; GFX10-NEXT:    s_mul_i32 s9, s6, s5
+; GFX10-NEXT:    s_not_b32 s10, s3
+; GFX10-NEXT:    s_mul_i32 s9, s6, s3
 ; GFX10-NEXT:    s_mul_i32 s10, s6, s10
 ; GFX10-NEXT:    s_sub_i32 s9, s7, s9
-; GFX10-NEXT:    s_add_i32 s11, s5, 1
+; GFX10-NEXT:    s_add_i32 s11, s3, 1
 ; GFX10-NEXT:    s_add_i32 s10, s7, s10
 ; GFX10-NEXT:    s_cmp_ge_u32 s9, s6
-; GFX10-NEXT:    s_cselect_b32 s11, s11, s5
+; GFX10-NEXT:    s_cselect_b32 s11, s11, s3
 ; GFX10-NEXT:    s_cselect_b32 s9, s10, s9
 ; GFX10-NEXT:    s_add_i32 s10, s11, 1
 ; GFX10-NEXT:    s_cmp_ge_u32 s9, s6
 ; GFX10-NEXT:    s_cselect_b32 s9, s10, s11
-; GFX10-NEXT:    s_add_u32 s10, s0, s2
-; GFX10-NEXT:    s_addc_u32 s11, s1, s3
+; GFX10-NEXT:    s_add_u32 s10, s0, s4
+; GFX10-NEXT:    s_addc_u32 s11, s1, s5
 ; GFX10-NEXT:    s_add_i32 s7, s7, 1
-; GFX10-NEXT:    s_add_u32 s4, s4, s8
+; GFX10-NEXT:    s_add_u32 s4, s4, 4
 ; GFX10-NEXT:    v_mov_b32_e32 v1, s9
 ; GFX10-NEXT:    s_addc_u32 s5, s5, 0
-; GFX10-NEXT:    s_add_u32 s2, s2, 4
+; GFX10-NEXT:    s_add_u32 s2, s2, s8
 ; GFX10-NEXT:    s_addc_u32 s3, s3, 0
-; GFX10-NEXT:    s_cmpk_eq_i32 s2, 0x1000
+; GFX10-NEXT:    s_cmpk_eq_i32 s4, 0x1000
 ; GFX10-NEXT:    global_store_dword v0, v1, s[10:11]
 ; GFX10-NEXT:    s_cbranch_scc0 .LBB0_1
 ; GFX10-NEXT:  ; %bb.2: ; %bb2
@@ -123,28 +123,27 @@ define amdgpu_kernel void @udiv32_invariant_denom(ptr addrspace(1) nocapture %ar
 ; GFX11-NEXT:    .p2align 6
 ; GFX11-NEXT:  .LBB0_1: ; %bb3
 ; GFX11-NEXT:    ; =>This Inner Loop Header: Depth=1
-; GFX11-NEXT:    s_delay_alu instid0(SALU_CYCLE_1)
-; GFX11-NEXT:    s_not_b32 s10, s5
-; GFX11-NEXT:    s_mul_i32 s9, s6, s5
+; GFX11-NEXT:    s_not_b32 s10, s3
+; GFX11-NEXT:    s_mul_i32 s9, s6, s3
 ; GFX11-NEXT:    s_mul_i32 s10, s6, s10
 ; GFX11-NEXT:    s_sub_i32 s9, s7, s9
-; GFX11-NEXT:    s_add_i32 s11, s5, 1
+; GFX11-NEXT:    s_ad...
[truncated]

llvmbot · 2025-07-08T19:47:48Z

@llvm/pr-subscribers-backend-aarch64

Author: Sergey Shcherbinin (SergeyShch01)

Changes

This optimization rejects formulas not containing registers already included into current solution and also included into other formulas of current LSRUse. It really helps to reduce compilation time but also has side effects which are mitigated/fixed as follows:

The best found solution becomes dependent on the LSRUses ordering since at any given time the registers included into the current solution (and used to filter formulas) are determined by processed LSRUses for which formulas to be included into current solution are already chosen. So the decisions made for LSRUses which are processed earlier directly affect the decisions made for LSRUses processed later. According to the current implementation the LSRUses ordering seems to be random as it's determined by the order of IV instructions' users traversal which can be assumed random. Besides directly affecting the performance of the generated code (which is randomized), this can also lead to performance fluctuations when the order of IV instructions' users traversal changes for some reason (in fact, it can even be different between clang and llc runs). In order to minimize the described effect given patch introduces LSRUses sorting which establishes more-or-less stable LSRUses ordering (fluctuations are still possible for LSRUses which look identical). This change leads to loop_reduce work changes in many regression tests which are updated accordingly.
The formula filtering in SolveRecurse rejects all formulas that do not have the maximum number of required registers, although it is possible that none of the formulas do (for example, it is possible to have two formulas with two registers and one required register each - then all formulas will be rejected because they do not have two required ones). The change fixes this by gradually decreasing the number of required registers if all formulas are rejected.

The change also introduces an early return if LSRUse w/o formulas is found as that is an unsolvable case.

Patch is 246.26 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/147588.diff

54 Files Affected:

(modified) llvm/lib/Transforms/Scalar/LoopStrengthReduce.cpp (+108-44)
(modified) llvm/test/CodeGen/AArch64/aarch64-p2align-max-bytes.ll (+3-3)
(modified) llvm/test/CodeGen/AArch64/machine-combiner-copy.ll (+6-6)
(modified) llvm/test/CodeGen/AArch64/machine-licm-sub-loop.ll (+10-10)
(modified) llvm/test/CodeGen/AArch64/zext-to-tbl.ll (+27-27)
(modified) llvm/test/CodeGen/AMDGPU/idiv-licm.ll (+48-50)
(modified) llvm/test/CodeGen/AMDGPU/memintrinsic-unroll.ll (+128-130)
(modified) llvm/test/CodeGen/AMDGPU/memmove-var-size.ll (+46-46)
(modified) llvm/test/CodeGen/AMDGPU/mul24-pass-ordering.ll (+11-11)
(modified) llvm/test/CodeGen/AMDGPU/noclobber-barrier.ll (+44-44)
(modified) llvm/test/CodeGen/ARM/2011-03-15-LdStMultipleBug.ll (+4-6)
(modified) llvm/test/CodeGen/ARM/dsp-loop-indexing.ll (+6-6)
(modified) llvm/test/CodeGen/ARM/fpclamptosat.ll (+34-26)
(modified) llvm/test/CodeGen/ARM/loop-indexing.ll (+1-1)
(modified) llvm/test/CodeGen/NVPTX/load-with-non-coherent-cache.ll (+2-2)
(modified) llvm/test/CodeGen/PowerPC/lsr-profitable-chain.ll (+32-32)
(modified) llvm/test/CodeGen/PowerPC/more-dq-form-prepare.ll (+56-56)
(modified) llvm/test/CodeGen/RISCV/riscv-codegenprepare-asm.ll (+10-7)
(modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-strided-load-store-asm.ll (+2-2)
(modified) llvm/test/CodeGen/Thumb2/LowOverheadLoops/mve-float-loops.ll (+44-44)
(modified) llvm/test/CodeGen/Thumb2/LowOverheadLoops/mve-tail-data-types.ll (+12-12)
(modified) llvm/test/CodeGen/Thumb2/mve-blockplacement.ll (+17-17)
(modified) llvm/test/CodeGen/Thumb2/mve-float16regloops.ll (+7-7)
(modified) llvm/test/CodeGen/Thumb2/mve-float32regloops.ll (+7-7)
(modified) llvm/test/CodeGen/Thumb2/mve-memtp-loop.ll (+27-27)
(modified) llvm/test/CodeGen/WebAssembly/unrolled-mem-indices.ll (+84-77)
(modified) llvm/test/CodeGen/X86/apx/check-nf-in-suppress-reloc-pass.ll (+1-1)
(modified) llvm/test/CodeGen/X86/avx512vnni-combine.ll (+33-25)
(modified) llvm/test/CodeGen/X86/avxvnni-combine.ll (+138-106)
(modified) llvm/test/CodeGen/X86/dag-update-nodetomatch.ll (+37-32)
(modified) llvm/test/CodeGen/X86/loop-strength-reduce7.ll (+5-5)
(modified) llvm/test/CodeGen/X86/masked-iv-safe.ll (+14-14)
(modified) llvm/test/CodeGen/X86/optimize-max-0.ll (+6-6)
(modified) llvm/test/CodeGen/X86/pr42565.ll (+6-6)
(modified) llvm/test/CodeGen/X86/ragreedy-hoist-spill.ll (+1-1)
(modified) llvm/test/Transforms/LoopStrengthReduce/AArch64/vscale-fixups.ll (+42-112)
(modified) llvm/test/Transforms/LoopStrengthReduce/AMDGPU/atomics.ll (+14-14)
(modified) llvm/test/Transforms/LoopStrengthReduce/AMDGPU/different-addrspace-addressing-mode-loops.ll (+14-14)
(modified) llvm/test/Transforms/LoopStrengthReduce/AMDGPU/lsr-postinc-pos-addrspace.ll (+6-6)
(modified) llvm/test/Transforms/LoopStrengthReduce/ARM/complexity.ll (+1-1)
(modified) llvm/test/Transforms/LoopStrengthReduce/ARM/illegal-addr-modes.ll (+1-2)
(modified) llvm/test/Transforms/LoopStrengthReduce/RISCV/many-geps.ll (+15-15)
(modified) llvm/test/Transforms/LoopStrengthReduce/X86/ivchain-X86.ll (+10-10)
(modified) llvm/test/Transforms/LoopStrengthReduce/X86/missing-phi-operand-update.ll (+12-12)
(modified) llvm/test/Transforms/LoopStrengthReduce/X86/nested-loop.ll (+9-9)
(modified) llvm/test/Transforms/LoopStrengthReduce/X86/normalization-during-scev-expansion.ll (+24-24)
(modified) llvm/test/Transforms/LoopStrengthReduce/X86/postinc-iv-used-by-urem-and-udiv.ll (+23-24)
(modified) llvm/test/Transforms/LoopStrengthReduce/X86/pr62660-normalization-failure.ll (+6-7)
(modified) llvm/test/Transforms/LoopStrengthReduce/depth-limit-overrun.ll (+4-4)
(modified) llvm/test/Transforms/LoopStrengthReduce/ivincs-hoist.ll (+3-3)
(modified) llvm/test/Transforms/LoopStrengthReduce/lsr-overflow.ll (+5-5)
(modified) llvm/test/Transforms/LoopStrengthReduce/lsr-term-fold-negative-testcase.ll (+6-6)
(modified) llvm/test/Transforms/LoopStrengthReduce/uglygep.ll (+13-19)
(modified) llvm/test/Transforms/LoopStrengthReduce/wrong-hoisting-iv.ll (+22-22)

diff --git a/llvm/lib/Transforms/Scalar/LoopStrengthReduce.cpp b/llvm/lib/Transforms/Scalar/LoopStrengthReduce.cpp
index 845afa6d4228b..7666950529583 100644
--- a/llvm/lib/Transforms/Scalar/LoopStrengthReduce.cpp
+++ b/llvm/lib/Transforms/Scalar/LoopStrengthReduce.cpp
@@ -2229,6 +2229,7 @@ class LSRInstance {
   void NarrowSearchSpaceByDeletingCostlyFormulas();
   void NarrowSearchSpaceByPickingWinnerRegs();
   void NarrowSearchSpaceUsingHeuristics();
+  bool SortLSRUses();
 
   void SolveRecurse(SmallVectorImpl<const Formula *> &Solution,
                     Cost &SolutionCost,
@@ -5368,6 +5369,46 @@ void LSRInstance::NarrowSearchSpaceUsingHeuristics() {
     NarrowSearchSpaceByPickingWinnerRegs();
 }
 
+/// Sort LSRUses to address side effects of compile time optimization done in
+/// SolveRecurse which filters out formulae not including required registers.
+/// Such optimization makes the found best solution sensitive to the order
+/// of LSRUses processing, hence it's important to ensure that that order
+/// isn't random to avoid fluctuations and sub-optimal results.
+///
+/// Also check that all LSRUses have formulae as otherwise the situation is
+/// unsolvable.
+bool LSRInstance::SortLSRUses() {
+  SmallVector<LSRUse *, 16> NewOrder;
+  for (auto &LU : Uses) {
+    if (!LU.Formulae.size()) {
+      return false;
+    }
+    NewOrder.push_back(&LU);
+  }
+
+  std::stable_sort(
+      NewOrder.begin(), NewOrder.end(), [](const LSRUse *L, const LSRUse *R) {
+        auto CalcKey = [](const LSRUse *LU) {
+          // LSRUses w/ many registers and formulae go first avoid too big
+          // reduction of considered solutions count
+          return std::tuple(LU->Regs.size(), LU->Formulae.size(), LU->Kind,
+                            LU->MinOffset.getKnownMinValue(),
+                            LU->MaxOffset.getKnownMinValue(),
+                            LU->AllFixupsOutsideLoop, LU->RigidFormula);
+        };
+        return CalcKey(L) > CalcKey(R);
+      });
+
+  SmallVector<LSRUse, 4> NewUses;
+  for (LSRUse *LU : NewOrder)
+    NewUses.push_back(std::move(*LU));
+  Uses = std::move(NewUses);
+
+  LLVM_DEBUG(dbgs() << "\nAfter sorting:\n"; print_uses(dbgs()));
+
+  return true;
+}
+
 /// This is the recursive solver.
 void LSRInstance::SolveRecurse(SmallVectorImpl<const Formula *> &Solution,
                                Cost &SolutionCost,
@@ -5387,6 +5428,10 @@ void LSRInstance::SolveRecurse(SmallVectorImpl<const Formula *> &Solution,
 
   const LSRUse &LU = Uses[Workspace.size()];
 
+  assert(LU.Formulae.size() &&
+         "LSRUse w/o formulae leads to unsolvable situation so it"
+         "shouldn't be here");
+
   // If this use references any register that's already a part of the
   // in-progress solution, consider it a requirement that a formula must
   // reference that register in order to be considered. This prunes out
@@ -5398,54 +5443,69 @@ void LSRInstance::SolveRecurse(SmallVectorImpl<const Formula *> &Solution,
 
   SmallPtrSet<const SCEV *, 16> NewRegs;
   Cost NewCost(L, SE, TTI, AMK);
-  for (const Formula &F : LU.Formulae) {
-    // Ignore formulae which may not be ideal in terms of register reuse of
-    // ReqRegs.  The formula should use all required registers before
-    // introducing new ones.
-    // This can sometimes (notably when trying to favour postinc) lead to
-    // sub-optimial decisions. There it is best left to the cost modelling to
-    // get correct.
-    if (AMK != TTI::AMK_PostIndexed || LU.Kind != LSRUse::Address) {
-      int NumReqRegsToFind = std::min(F.getNumRegs(), ReqRegs.size());
-      for (const SCEV *Reg : ReqRegs) {
-        if ((F.ScaledReg && F.ScaledReg == Reg) ||
-            is_contained(F.BaseRegs, Reg)) {
-          --NumReqRegsToFind;
-          if (NumReqRegsToFind == 0)
-            break;
+  bool FormulaeTested = false;
+  unsigned NumReqRegsToIgnore = 0;
+
+  while (!FormulaeTested) {
+    assert(
+        !NumReqRegsToIgnore ||
+        NumReqRegsToIgnore < ReqRegs.size() &&
+            "at least one formulae should have at least one required register");
+
+    for (const Formula &F : LU.Formulae) {
+      // ReqRegs. The formula should use required registers before
+      // introducing new ones. Firstly try the most aggressive option
+      // (when maximum of required registers are used) and then gradually make
+      // it weaker if all formulae don't satisfy this requirement.
+      //
+      // This can sometimes (notably when trying to favour postinc) lead to
+      // sub-optimal decisions. There it is best left to the cost modeling to
+      // get correct.
+      if (ReqRegs.size() &&
+          (AMK != TTI::AMK_PostIndexed || LU.Kind != LSRUse::Address)) {
+        unsigned NumReqRegsToFind = std::min(F.getNumRegs(), ReqRegs.size());
+        bool ReqRegsFound = false;
+        for (const SCEV *Reg : ReqRegs) {
+          if ((F.ScaledReg && F.ScaledReg == Reg) ||
+              is_contained(F.BaseRegs, Reg)) {
+            ReqRegsFound = true;
+            if (--NumReqRegsToFind == NumReqRegsToIgnore)
+              break;
+          }
+        }
+        if (!ReqRegsFound || NumReqRegsToFind != NumReqRegsToIgnore) {
+          continue;
         }
       }
-      if (NumReqRegsToFind != 0) {
-        // If none of the formulae satisfied the required registers, then we could
-        // clear ReqRegs and try again. Currently, we simply give up in this case.
-        continue;
-      }
-    }
 
-    // Evaluate the cost of the current formula. If it's already worse than
-    // the current best, prune the search at that point.
-    NewCost = CurCost;
-    NewRegs = CurRegs;
-    NewCost.RateFormula(F, NewRegs, VisitedRegs, LU);
-    if (NewCost.isLess(SolutionCost)) {
-      Workspace.push_back(&F);
-      if (Workspace.size() != Uses.size()) {
-        SolveRecurse(Solution, SolutionCost, Workspace, NewCost,
-                     NewRegs, VisitedRegs);
-        if (F.getNumRegs() == 1 && Workspace.size() == 1)
-          VisitedRegs.insert(F.ScaledReg ? F.ScaledReg : F.BaseRegs[0]);
-      } else {
-        LLVM_DEBUG(dbgs() << "New best at "; NewCost.print(dbgs());
-                   dbgs() << ".\nRegs:\n";
-                   for (const SCEV *S : NewRegs) dbgs()
-                      << "- " << *S << "\n";
-                   dbgs() << '\n');
-
-        SolutionCost = NewCost;
-        Solution = Workspace;
+      // Evaluate the cost of the current formula. If it's already worse than
+      // the current best, prune the search at that point.
+      FormulaeTested = true;
+      NewCost = CurCost;
+      NewRegs = CurRegs;
+      NewCost.RateFormula(F, NewRegs, VisitedRegs, LU);
+      if (NewCost.isLess(SolutionCost)) {
+        Workspace.push_back(&F);
+        if (Workspace.size() != Uses.size()) {
+          SolveRecurse(Solution, SolutionCost, Workspace, NewCost, NewRegs,
+                       VisitedRegs);
+          if (F.getNumRegs() == 1 && Workspace.size() == 1)
+            VisitedRegs.insert(F.ScaledReg ? F.ScaledReg : F.BaseRegs[0]);
+        } else {
+          LLVM_DEBUG(dbgs() << "New best at "; NewCost.print(dbgs());
+                     dbgs() << ".\nRegs:\n";
+                     for (const SCEV *S : NewRegs) dbgs() << "- " << *S << "\n";
+                     dbgs() << '\n');
+          SolutionCost = NewCost;
+          Solution = Workspace;
+        }
+        Workspace.pop_back();
       }
-      Workspace.pop_back();
     }
+
+    // none of formulae has necessary number of required registers - then
+    // make the requirement weaker
+    NumReqRegsToIgnore++;
   }
 }
 
@@ -6180,7 +6240,11 @@ LSRInstance::LSRInstance(Loop *L, IVUsers &IU, ScalarEvolution &SE,
   NarrowSearchSpaceUsingHeuristics();
 
   SmallVector<const Formula *, 8> Solution;
-  Solve(Solution);
+  bool LSRUsesConsistent = SortLSRUses();
+
+  if (LSRUsesConsistent) {
+    Solve(Solution);
+  }
 
   // Release memory that is no longer needed.
   Factors.clear();
diff --git a/llvm/test/CodeGen/AArch64/aarch64-p2align-max-bytes.ll b/llvm/test/CodeGen/AArch64/aarch64-p2align-max-bytes.ll
index 99e0d06bf4218..b54492b38ea79 100644
--- a/llvm/test/CodeGen/AArch64/aarch64-p2align-max-bytes.ll
+++ b/llvm/test/CodeGen/AArch64/aarch64-p2align-max-bytes.ll
@@ -16,14 +16,14 @@ define i32 @a(i32 %x, ptr nocapture readonly %y, ptr nocapture readonly %z) {
 ; CHECK-IMPLICIT:    .p2align 5
 ; CHECK-NEXT:  .LBB0_8: // %for.body
 ; CHECK-OBJ;Disassembly of section .text:
-; CHECK-OBJ:               88: 8b0a002a      add
+; CHECK-OBJ:               88: 8b0b002b      add
 ; CHECK-OBJ-IMPLICIT-NEXT: 8c: d503201f      nop
 ; CHECK-OBJ-IMPLICIT-NEXT: 90: d503201f      nop
 ; CHECK-OBJ-IMPLICIT-NEXT: 94: d503201f      nop
 ; CHECK-OBJ-IMPLICIT-NEXT: 98: d503201f      nop
 ; CHECK-OBJ-IMPLICIT-NEXT: 9c: d503201f      nop
-; CHECK-OBJ-IMPLICIT-NEXT: a0: b840454b      ldr
-; CHECK-OBJ-EXPLICIT-NEXT: 8c: b840454b      ldr
+; CHECK-OBJ-IMPLICIT-NEXT: a0: b8404569      ldr
+; CHECK-OBJ-EXPLICIT-NEXT: 8c: b8404569      ldr
 entry:
   %cmp10 = icmp sgt i32 %x, 0
   br i1 %cmp10, label %for.body.preheader, label %for.cond.cleanup
diff --git a/llvm/test/CodeGen/AArch64/machine-combiner-copy.ll b/llvm/test/CodeGen/AArch64/machine-combiner-copy.ll
index 4c8e589391c3a..8ffdc1e8dcb41 100644
--- a/llvm/test/CodeGen/AArch64/machine-combiner-copy.ll
+++ b/llvm/test/CodeGen/AArch64/machine-combiner-copy.ll
@@ -33,17 +33,17 @@ define void @fma_dup_f16(ptr noalias nocapture noundef readonly %A, half noundef
 ; CHECK-NEXT:    cmp x9, x8
 ; CHECK-NEXT:    b.eq .LBB0_8
 ; CHECK-NEXT:  .LBB0_6: // %for.body.preheader1
-; CHECK-NEXT:    lsl x10, x9, #1
+; CHECK-NEXT:    lsl x11, x9, #1
 ; CHECK-NEXT:    sub x8, x8, x9
-; CHECK-NEXT:    add x9, x1, x10
-; CHECK-NEXT:    add x10, x0, x10
+; CHECK-NEXT:    add x10, x1, x11
+; CHECK-NEXT:    add x11, x0, x11
 ; CHECK-NEXT:  .LBB0_7: // %for.body
 ; CHECK-NEXT:    // =>This Inner Loop Header: Depth=1
-; CHECK-NEXT:    ldr h1, [x10], #2
-; CHECK-NEXT:    ldr h2, [x9]
+; CHECK-NEXT:    ldr h1, [x11], #2
+; CHECK-NEXT:    ldr h2, [x10]
 ; CHECK-NEXT:    subs x8, x8, #1
 ; CHECK-NEXT:    fmadd h1, h1, h0, h2
-; CHECK-NEXT:    str h1, [x9], #2
+; CHECK-NEXT:    str h1, [x10], #2
 ; CHECK-NEXT:    b.ne .LBB0_7
 ; CHECK-NEXT:  .LBB0_8: // %for.cond.cleanup
 ; CHECK-NEXT:    ret
diff --git a/llvm/test/CodeGen/AArch64/machine-licm-sub-loop.ll b/llvm/test/CodeGen/AArch64/machine-licm-sub-loop.ll
index f6bbdf5d95d87..cb281518d1fd5 100644
--- a/llvm/test/CodeGen/AArch64/machine-licm-sub-loop.ll
+++ b/llvm/test/CodeGen/AArch64/machine-licm-sub-loop.ll
@@ -36,23 +36,23 @@ define void @foo(i32 noundef %limit, ptr %out, ptr %y) {
 ; CHECK-NEXT:  .LBB0_5: // %vector.ph
 ; CHECK-NEXT:    // in Loop: Header=BB0_3 Depth=1
 ; CHECK-NEXT:    dup v0.8h, w15
-; CHECK-NEXT:    mov x16, x14
-; CHECK-NEXT:    mov x17, x13
-; CHECK-NEXT:    mov x18, x12
+; CHECK-NEXT:    mov x16, x12
+; CHECK-NEXT:    mov x17, x14
+; CHECK-NEXT:    mov x18, x13
 ; CHECK-NEXT:  .LBB0_6: // %vector.body
 ; CHECK-NEXT:    // Parent Loop BB0_3 Depth=1
 ; CHECK-NEXT:    // => This Inner Loop Header: Depth=2
-; CHECK-NEXT:    ldp q1, q4, [x16, #-16]
-; CHECK-NEXT:    subs x18, x18, #16
-; CHECK-NEXT:    ldp q3, q2, [x17, #-32]
-; CHECK-NEXT:    add x16, x16, #32
-; CHECK-NEXT:    ldp q6, q5, [x17]
+; CHECK-NEXT:    ldp q1, q4, [x17, #-16]
+; CHECK-NEXT:    subs x16, x16, #16
+; CHECK-NEXT:    ldp q3, q2, [x18, #-32]
+; CHECK-NEXT:    add x17, x17, #32
+; CHECK-NEXT:    ldp q6, q5, [x18]
 ; CHECK-NEXT:    smlal2 v2.4s, v0.8h, v1.8h
 ; CHECK-NEXT:    smlal v3.4s, v0.4h, v1.4h
 ; CHECK-NEXT:    smlal2 v5.4s, v0.8h, v4.8h
 ; CHECK-NEXT:    smlal v6.4s, v0.4h, v4.4h
-; CHECK-NEXT:    stp q3, q2, [x17, #-32]
-; CHECK-NEXT:    stp q6, q5, [x17], #64
+; CHECK-NEXT:    stp q3, q2, [x18, #-32]
+; CHECK-NEXT:    stp q6, q5, [x18], #64
 ; CHECK-NEXT:    b.ne .LBB0_6
 ; CHECK-NEXT:  // %bb.7: // %middle.block
 ; CHECK-NEXT:    // in Loop: Header=BB0_3 Depth=1
diff --git a/llvm/test/CodeGen/AArch64/zext-to-tbl.ll b/llvm/test/CodeGen/AArch64/zext-to-tbl.ll
index 2a37183c47d51..bc78558ec398d 100644
--- a/llvm/test/CodeGen/AArch64/zext-to-tbl.ll
+++ b/llvm/test/CodeGen/AArch64/zext-to-tbl.ll
@@ -1666,15 +1666,15 @@ define void @zext_v8i8_to_v8i64_with_add_in_sequence_in_loop(ptr %src, ptr %dst)
 ; CHECK-NEXT:    ldr q0, [x8, lCPI17_0@PAGEOFF]
 ; CHECK-NEXT:  Lloh21:
 ; CHECK-NEXT:    ldr q1, [x9, lCPI17_1@PAGEOFF]
-; CHECK-NEXT:    add x8, x1, #64
-; CHECK-NEXT:    add x9, x0, #8
+; CHECK-NEXT:    add x8, x0, #8
+; CHECK-NEXT:    add x9, x1, #64
 ; CHECK-NEXT:  LBB17_1: ; %loop
 ; CHECK-NEXT:    ; =>This Inner Loop Header: Depth=1
-; CHECK-NEXT:    ldp d2, d3, [x9, #-8]
+; CHECK-NEXT:    ldp d2, d3, [x8, #-8]
 ; CHECK-NEXT:    subs x10, x10, #16
-; CHECK-NEXT:    ldp q7, q5, [x8, #-32]
-; CHECK-NEXT:    add x9, x9, #16
-; CHECK-NEXT:    ldp q17, q6, [x8, #-64]
+; CHECK-NEXT:    ldp q7, q5, [x9, #-32]
+; CHECK-NEXT:    add x8, x8, #16
+; CHECK-NEXT:    ldp q17, q6, [x9, #-64]
 ; CHECK-NEXT:    tbl.16b v4, { v2 }, v1
 ; CHECK-NEXT:    tbl.16b v2, { v2 }, v0
 ; CHECK-NEXT:    tbl.16b v16, { v3 }, v1
@@ -1682,17 +1682,17 @@ define void @zext_v8i8_to_v8i64_with_add_in_sequence_in_loop(ptr %src, ptr %dst)
 ; CHECK-NEXT:    uaddw2.2d v5, v5, v4
 ; CHECK-NEXT:    uaddw2.2d v6, v6, v2
 ; CHECK-NEXT:    uaddw.2d v4, v7, v4
-; CHECK-NEXT:    ldp q18, q7, [x8, #32]
+; CHECK-NEXT:    ldp q18, q7, [x9, #32]
 ; CHECK-NEXT:    uaddw.2d v2, v17, v2
-; CHECK-NEXT:    stp q4, q5, [x8, #-32]
+; CHECK-NEXT:    stp q4, q5, [x9, #-32]
 ; CHECK-NEXT:    uaddw2.2d v5, v7, v16
-; CHECK-NEXT:    stp q2, q6, [x8, #-64]
+; CHECK-NEXT:    stp q2, q6, [x9, #-64]
 ; CHECK-NEXT:    uaddw.2d v16, v18, v16
-; CHECK-NEXT:    ldp q7, q6, [x8]
-; CHECK-NEXT:    stp q16, q5, [x8, #32]
+; CHECK-NEXT:    ldp q7, q6, [x9]
+; CHECK-NEXT:    stp q16, q5, [x9, #32]
 ; CHECK-NEXT:    uaddw2.2d v4, v6, v3
 ; CHECK-NEXT:    uaddw.2d v2, v7, v3
-; CHECK-NEXT:    stp q2, q4, [x8], #128
+; CHECK-NEXT:    stp q2, q4, [x9], #128
 ; CHECK-NEXT:    b.ne LBB17_1
 ; CHECK-NEXT:  ; %bb.2: ; %exit
 ; CHECK-NEXT:    ret
@@ -1708,31 +1708,31 @@ define void @zext_v8i8_to_v8i64_with_add_in_sequence_in_loop(ptr %src, ptr %dst)
 ; CHECK-BE-NEXT:    adrp x9, .LCPI17_1
 ; CHECK-BE-NEXT:    add x9, x9, :lo12:.LCPI17_1
 ; CHECK-BE-NEXT:    ld1 { v1.16b }, [x9]
-; CHECK-BE-NEXT:    add x9, x1, #64
-; CHECK-BE-NEXT:    add x10, x0, #8
+; CHECK-BE-NEXT:    add x9, x0, #8
+; CHECK-BE-NEXT:    add x10, x1, #64
 ; CHECK-BE-NEXT:  .LBB17_1: // %loop
 ; CHECK-BE-NEXT:    // =>This Inner Loop Header: Depth=1
-; CHECK-BE-NEXT:    ld1 { v2.8b }, [x10]
-; CHECK-BE-NEXT:    sub x11, x10, #8
-; CHECK-BE-NEXT:    add x15, x9, #32
+; CHECK-BE-NEXT:    ld1 { v2.8b }, [x9]
+; CHECK-BE-NEXT:    sub x11, x9, #8
+; CHECK-BE-NEXT:    add x15, x10, #32
 ; CHECK-BE-NEXT:    ld1 { v3.8b }, [x11]
 ; CHECK-BE-NEXT:    ld1 { v16.2d }, [x15]
-; CHECK-BE-NEXT:    sub x11, x9, #64
-; CHECK-BE-NEXT:    sub x12, x9, #32
-; CHECK-BE-NEXT:    ld1 { v6.2d }, [x9]
+; CHECK-BE-NEXT:    sub x11, x10, #64
+; CHECK-BE-NEXT:    sub x12, x10, #32
+; CHECK-BE-NEXT:    ld1 { v6.2d }, [x10]
 ; CHECK-BE-NEXT:    ld1 { v21.2d }, [x11]
 ; CHECK-BE-NEXT:    tbl v4.16b, { v2.16b }, v1.16b
 ; CHECK-BE-NEXT:    tbl v2.16b, { v2.16b }, v0.16b
 ; CHECK-BE-NEXT:    ld1 { v19.2d }, [x12]
 ; CHECK-BE-NEXT:    tbl v5.16b, { v3.16b }, v1.16b
 ; CHECK-BE-NEXT:    tbl v3.16b, { v3.16b }, v0.16b
-; CHECK-BE-NEXT:    sub x13, x9, #16
-; CHECK-BE-NEXT:    sub x14, x9, #48
-; CHECK-BE-NEXT:    add x16, x9, #48
-; CHECK-BE-NEXT:    add x17, x9, #16
+; CHECK-BE-NEXT:    sub x13, x10, #16
+; CHECK-BE-NEXT:    sub x14, x10, #48
+; CHECK-BE-NEXT:    add x16, x10, #48
+; CHECK-BE-NEXT:    add x17, x10, #16
 ; CHECK-BE-NEXT:    ld1 { v22.2d }, [x13]
 ; CHECK-BE-NEXT:    subs x8, x8, #16
-; CHECK-BE-NEXT:    add x10, x10, #16
+; CHECK-BE-NEXT:    add x9, x9, #16
 ; CHECK-BE-NEXT:    rev32 v7.8b, v4.8b
 ; CHECK-BE-NEXT:    ext v4.16b, v4.16b, v4.16b, #8
 ; CHECK-BE-NEXT:    rev32 v17.8b, v2.8b
@@ -1753,8 +1753,8 @@ define void @zext_v8i8_to_v8i64_with_add_in_sequence_in_loop(ptr %src, ptr %dst)
 ; CHECK-BE-NEXT:    uaddw v3.2d, v21.2d, v3.2s
 ; CHECK-BE-NEXT:    st1 { v7.2d }, [x15]
 ; CHECK-BE-NEXT:    ld1 { v7.2d }, [x17]
-; CHECK-BE-NEXT:    st1 { v6.2d }, [x9]
-; CHECK-BE-NEXT:    add x9, x9, #128
+; CHECK-BE-NEXT:    st1 { v6.2d }, [x10]
+; CHECK-BE-NEXT:    add x10, x10, #128
 ; CHECK-BE-NEXT:    uaddw v4.2d, v16.2d, v4.2s
 ; CHECK-BE-NEXT:    st1 { v5.2d }, [x12]
 ; CHECK-BE-NEXT:    uaddw v5.2d, v22.2d, v17.2s
diff --git a/llvm/test/CodeGen/AMDGPU/idiv-licm.ll b/llvm/test/CodeGen/AMDGPU/idiv-licm.ll
index ecbf5dfeb3af1..d0a6255cd0395 100644
--- a/llvm/test/CodeGen/AMDGPU/idiv-licm.ll
+++ b/llvm/test/CodeGen/AMDGPU/idiv-licm.ll
@@ -24,27 +24,27 @@ define amdgpu_kernel void @udiv32_invariant_denom(ptr addrspace(1) nocapture %ar
 ; GFX9-NEXT:    s_mov_b64 s[4:5], 0
 ; GFX9-NEXT:  .LBB0_1: ; %bb3
 ; GFX9-NEXT:    ; =>This Inner Loop Header: Depth=1
-; GFX9-NEXT:    s_not_b32 s10, s5
-; GFX9-NEXT:    s_mul_i32 s9, s6, s5
+; GFX9-NEXT:    s_not_b32 s10, s3
+; GFX9-NEXT:    s_mul_i32 s9, s6, s3
 ; GFX9-NEXT:    s_mul_i32 s10, s6, s10
-; GFX9-NEXT:    s_add_i32 s11, s5, 1
+; GFX9-NEXT:    s_add_i32 s11, s3, 1
 ; GFX9-NEXT:    s_sub_i32 s9, s7, s9
 ; GFX9-NEXT:    s_add_i32 s10, s7, s10
 ; GFX9-NEXT:    s_cmp_ge_u32 s9, s6
-; GFX9-NEXT:    s_cselect_b32 s11, s11, s5
+; GFX9-NEXT:    s_cselect_b32 s11, s11, s3
 ; GFX9-NEXT:    s_cselect_b32 s9, s10, s9
 ; GFX9-NEXT:    s_add_i32 s10, s11, 1
 ; GFX9-NEXT:    s_cmp_ge_u32 s9, s6
 ; GFX9-NEXT:    s_cselect_b32 s9, s10, s11
-; GFX9-NEXT:    s_add_u32 s10, s0, s2
-; GFX9-NEXT:    s_addc_u32 s11, s1, s3
+; GFX9-NEXT:    s_add_u32 s10, s0, s4
+; GFX9-NEXT:    s_addc_u32 s11, s1, s5
 ; GFX9-NEXT:    s_add_i32 s7, s7, 1
-; GFX9-NEXT:    s_add_u32 s4, s4, s8
+; GFX9-NEXT:    s_add_u32 s4, s4, 4
 ; GFX9-NEXT:    s_addc_u32 s5, s5, 0
-; GFX9-NEXT:    s_add_u32 s2, s2, 4
+; GFX9-NEXT:    s_add_u32 s2, s2, s8
 ; GFX9-NEXT:    s_addc_u32 s3, s3, 0
 ; GFX9-NEXT:    v_mov_b32_e32 v1, s9
-; GFX9-NEXT:    s_cmpk_eq_i32 s2, 0x1000
+; GFX9-NEXT:    s_cmpk_eq_i32 s4, 0x1000
 ; GFX9-NEXT:    global_store_dword v0, v1, s[10:11]
 ; GFX9-NEXT:    s_cbranch_scc0 .LBB0_1
 ; GFX9-NEXT:  ; %bb.2: ; %bb2
@@ -72,27 +72,27 @@ define amdgpu_kernel void @udiv32_invariant_denom(ptr addrspace(1) nocapture %ar
 ; GFX10-NEXT:  .LBB0_1: ; %bb3
 ; GFX10-NEXT:    ; =>This Inner Loop Header: Depth=1
 ; GFX10-NEXT:    s_waitcnt_depctr 0xffe3
-; GFX10-NEXT:    s_not_b32 s10, s5
-; GFX10-NEXT:    s_mul_i32 s9, s6, s5
+; GFX10-NEXT:    s_not_b32 s10, s3
+; GFX10-NEXT:    s_mul_i32 s9, s6, s3
 ; GFX10-NEXT:    s_mul_i32 s10, s6, s10
 ; GFX10-NEXT:    s_sub_i32 s9, s7, s9
-; GFX10-NEXT:    s_add_i32 s11, s5, 1
+; GFX10-NEXT:    s_add_i32 s11, s3, 1
 ; GFX10-NEXT:    s_add_i32 s10, s7, s10
 ; GFX10-NEXT:    s_cmp_ge_u32 s9, s6
-; GFX10-NEXT:    s_cselect_b32 s11, s11, s5
+; GFX10-NEXT:    s_cselect_b32 s11, s11, s3
 ; GFX10-NEXT:    s_cselect_b32 s9, s10, s9
 ; GFX10-NEXT:    s_add_i32 s10, s11, 1
 ; GFX10-NEXT:    s_cmp_ge_u32 s9, s6
 ; GFX10-NEXT:    s_cselect_b32 s9, s10, s11
-; GFX10-NEXT:    s_add_u32 s10, s0, s2
-; GFX10-NEXT:    s_addc_u32 s11, s1, s3
+; GFX10-NEXT:    s_add_u32 s10, s0, s4
+; GFX10-NEXT:    s_addc_u32 s11, s1, s5
 ; GFX10-NEXT:    s_add_i32 s7, s7, 1
-; GFX10-NEXT:    s_add_u32 s4, s4, s8
+; GFX10-NEXT:    s_add_u32 s4, s4, 4
 ; GFX10-NEXT:    v_mov_b32_e32 v1, s9
 ; GFX10-NEXT:    s_addc_u32 s5, s5, 0
-; GFX10-NEXT:    s_add_u32 s2, s2, 4
+; GFX10-NEXT:    s_add_u32 s2, s2, s8
 ; GFX10-NEXT:    s_addc_u32 s3, s3, 0
-; GFX10-NEXT:    s_cmpk_eq_i32 s2, 0x1000
+; GFX10-NEXT:    s_cmpk_eq_i32 s4, 0x1000
 ; GFX10-NEXT:    global_store_dword v0, v1, s[10:11]
 ; GFX10-NEXT:    s_cbranch_scc0 .LBB0_1
 ; GFX10-NEXT:  ; %bb.2: ; %bb2
@@ -123,28 +123,27 @@ define amdgpu_kernel void @udiv32_invariant_denom(ptr addrspace(1) nocapture %ar
 ; GFX11-NEXT:    .p2align 6
 ; GFX11-NEXT:  .LBB0_1: ; %bb3
 ; GFX11-NEXT:    ; =>This Inner Loop Header: Depth=1
-; GFX11-NEXT:    s_delay_alu instid0(SALU_CYCLE_1)
-; GFX11-NEXT:    s_not_b32 s10, s5
-; GFX11-NEXT:    s_mul_i32 s9, s6, s5
+; GFX11-NEXT:    s_not_b32 s10, s3
+; GFX11-NEXT:    s_mul_i32 s9, s6, s3
 ; GFX11-NEXT:    s_mul_i32 s10, s6, s10
 ; GFX11-NEXT:    s_sub_i32 s9, s7, s9
-; GFX11-NEXT:    s_add_i32 s11, s5, 1
+; GFX11-NEXT:    s_ad...
[truncated]

arsenm · 2025-07-09T01:02:45Z