Skip to content

Commit 7c812ea

Browse files
authored
[RISCV] Avoid vl toggles when lowering vector_splice/experimental_vp_splice and add +vl-dependent-latency tuning feature (llvm#146746)
When vectorizing a loop with a fixed-order recurrence we use a splice, which gets lowered to a vslidedown and vslideup pair. However with the way we lower it today we end up with extra vl toggles in the loop, especially with EVL tail folding, e.g: .LBB0_5: # %vector.body # =>This Inner Loop Header: Depth=1 sub a5, a2, a3 sh2add a6, a3, a1 zext.w a7, a4 vsetvli a4, a5, e8, mf2, ta, ma vle32.v v10, (a6) addi a7, a7, -1 vsetivli zero, 1, e32, m2, ta, ma vslidedown.vx v8, v8, a7 sh2add a6, a3, a0 vsetvli zero, a5, e32, m2, ta, ma vslideup.vi v8, v10, 1 vadd.vv v8, v10, v8 add a3, a3, a4 vse32.v v8, (a6) vmv2r.v v8, v10 bne a3, a2, .LBB0_5 Because the vslideup overwrites all but UpOffset elements from the vslidedown, we currently set the vslidedown's AVL to said offset. But in the vslideup we use either VLMAX or the EVL which causes a toggle. This increases the AVL of the vslidedown so it matches vslideup, even if the extra elements are overridden, to avoid the toggle. A new tuning feature +vl-dependent-latency has been added which keeps the old behaviour for microarchitectures that dynamically dispatch uops based on vl, e.g. sifive-x280. +vl-dependent-latency can be reused for the recently proposed Ovlt optimization directive if/when it's ratified: https://lists.riscv.org/g/tech-privileged/message/2487 If we wanted to aggressively optimise for vl at the expense of introducing more toggles we could probably look at doing this in RISCVVLOptimizer.
1 parent a8280c4 commit 7c812ea

File tree

9 files changed

+4571
-2276
lines changed

9 files changed

+4571
-2276
lines changed

llvm/lib/Target/RISCV/RISCVFeatures.td

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1709,6 +1709,11 @@ foreach nf = {2-8} in
17091709
"true", "vlseg"#nf#"eN.v and vsseg"#nf#"eN.v are "
17101710
"implemented as a wide memory op and shuffle">;
17111711

1712+
def TuneVLDependentLatency
1713+
: SubtargetFeature<"vl-dependent-latency", "HasVLDependentLatency", "true",
1714+
"Latency of vector instructions is dependent on the "
1715+
"dynamic value of vl">;
1716+
17121717
def Experimental
17131718
: SubtargetFeature<"experimental", "HasExperimental",
17141719
"true", "Experimental intrinsics">;

llvm/lib/Target/RISCV/RISCVISelLowering.cpp

Lines changed: 5 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -12341,9 +12341,10 @@ SDValue RISCVTargetLowering::lowerVECTOR_SPLICE(SDValue Op,
1234112341

1234212342
SDValue TrueMask = getAllOnesMask(VecVT, VLMax, DL, DAG);
1234312343

12344-
SDValue SlideDown =
12345-
getVSlidedown(DAG, Subtarget, DL, VecVT, DAG.getUNDEF(VecVT), V1,
12346-
DownOffset, TrueMask, UpOffset);
12344+
SDValue SlideDown = getVSlidedown(
12345+
DAG, Subtarget, DL, VecVT, DAG.getUNDEF(VecVT), V1, DownOffset, TrueMask,
12346+
Subtarget.hasVLDependentLatency() ? UpOffset
12347+
: DAG.getRegister(RISCV::X0, XLenVT));
1234712348
return getVSlideup(DAG, Subtarget, DL, VecVT, SlideDown, V2, UpOffset,
1234812349
TrueMask, DAG.getRegister(RISCV::X0, XLenVT),
1234912350
RISCVVType::TAIL_AGNOSTIC);
@@ -13367,7 +13368,7 @@ RISCVTargetLowering::lowerVPSpliceExperimental(SDValue Op,
1336713368
if (ImmValue != 0)
1336813369
Op1 = getVSlidedown(DAG, Subtarget, DL, ContainerVT,
1336913370
DAG.getUNDEF(ContainerVT), Op1, DownOffset, Mask,
13370-
UpOffset);
13371+
Subtarget.hasVLDependentLatency() ? UpOffset : EVL2);
1337113372
SDValue Result = getVSlideup(DAG, Subtarget, DL, ContainerVT, Op1, Op2,
1337213373
UpOffset, Mask, EVL2, RISCVVType::TAIL_AGNOSTIC);
1337313374

llvm/lib/Target/RISCV/RISCVProcessors.td

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -275,7 +275,8 @@ def SIFIVE_U74 : RISCVProcessorModel<"sifive-u74",
275275
defvar SiFiveIntelligenceTuneFeatures = !listconcat(SiFive7TuneFeatures,
276276
[TuneDLenFactor2,
277277
TuneOptimizedZeroStrideLoad,
278-
TuneOptimizedNF2SegmentLoadStore]);
278+
TuneOptimizedNF2SegmentLoadStore,
279+
TuneVLDependentLatency]);
279280
def SIFIVE_X280 : RISCVProcessorModel<"sifive-x280", SiFive7Model,
280281
[Feature64Bit,
281282
FeatureStdExtI,

llvm/test/CodeGen/RISCV/features-info.ll

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -171,6 +171,7 @@
171171
; CHECK-NEXT: use-postra-scheduler - Schedule again after register allocation.
172172
; CHECK-NEXT: v - 'V' (Vector Extension for Application Processors).
173173
; CHECK-NEXT: ventana-veyron - Ventana Veyron-Series processors.
174+
; CHECK-NEXT: vl-dependent-latency - Latency of vector instructions is dependent on the dynamic value of vl.
174175
; CHECK-NEXT: vxrm-pipeline-flush - VXRM writes causes pipeline flush.
175176
; CHECK-NEXT: xandesperf - 'XAndesPerf' (Andes Performance Extension).
176177
; CHECK-NEXT: xandesvbfhcvt - 'XAndesVBFHCvt' (Andes Vector BFLOAT16 Conversion Extension).

0 commit comments

Comments
 (0)