Skip to content

Commit 8fac9e9

Browse files
committed
[X86] X86TTIImpl::getInterleavedMemoryOpCost(): scale interleaving cost by the fraction of live members
By definition, interleaving load of stride N means: load N*VF elements, and shuffle them into N VF-sized vectors, with 0'th vector containing elements `[0, VF)*stride + 0`, and 1'th vector containing elements `[0, VF)*stride + 1`. Example: https://godbolt.org/z/df561Me5E (i64 stride 4 vf 2 => cost 6) Now, not fully interleaved load, is when not all of these vectors is demanded. So at worst, we could just pretend that everything is demanded, and discard the non-demanded vectors. What this means is that the cost for not-fully-interleaved group should be not greater than the cost for the same fully-interleaved group, but perhaps somewhat less. Examples: https://godbolt.org/z/a78dK5Geq (i64 stride 4 (indices 012u) vf 2 => cost 4) https://godbolt.org/z/G91ceo8dM (i64 stride 4 (indices 01uu) vf 2 => cost 2) https://godbolt.org/z/5joYob9rx (i64 stride 4 (indices 0uuu) vf 2 => cost 1) Right now, for such not-fully-interleaved loads we just use the costs for fully-interleaved loads. But at least **in general**, that is obviously overly pessimistic, because **in general**, not all the shuffles needed to perform the full interleaving will end up being live. So what this does, is naively scales the interleaving cost by the fraction of the live members. I believe this should still result in the right ballpark cost estimate, although it may be over/under -estimate. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D112307
1 parent fd5e3f3 commit 8fac9e9

7 files changed

+45
-39
lines changed

llvm/lib/Target/X86/X86TargetTransformInfo.cpp

Lines changed: 11 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -5205,6 +5205,7 @@ InstructionCost X86TTIImpl::getInterleavedMemoryOpCost(
52055205
Type::getIntNTy(ScalarTy->getContext(), DL.getTypeSizeInBits(ScalarTy));
52065206

52075207
// Get the cost of all the memory operations.
5208+
// FIXME: discount dead loads.
52085209
InstructionCost MemOpCosts = getMemoryOpCost(
52095210
Opcode, VecTy, MaybeAlign(Alignment), AddressSpace, CostKind);
52105211

@@ -5424,22 +5425,27 @@ InstructionCost X86TTIImpl::getInterleavedMemoryOpCost(
54245425
};
54255426

54265427
if (Opcode == Instruction::Load) {
5427-
// FIXME: if we have a partially-interleaved groups, with gaps,
5428-
// should we discount the not-demanded indicies?
5428+
auto GetDiscountedCost = [Factor, NumMembers = Indices.size(),
5429+
MemOpCosts](const CostTblEntry *Entry) {
5430+
// NOTE: this is just an approximation!
5431+
// It can over/under -estimate the cost!
5432+
return MemOpCosts + divideCeil(NumMembers * Entry->Cost, Factor);
5433+
};
5434+
54295435
if (ST->hasAVX2())
54305436
if (const auto *Entry = CostTableLookup(AVX2InterleavedLoadTbl, Factor,
54315437
ETy.getSimpleVT()))
5432-
return MemOpCosts + Entry->Cost;
5438+
return GetDiscountedCost(Entry);
54335439

54345440
if (ST->hasSSSE3())
54355441
if (const auto *Entry = CostTableLookup(SSSE3InterleavedLoadTbl, Factor,
54365442
ETy.getSimpleVT()))
5437-
return MemOpCosts + Entry->Cost;
5443+
return GetDiscountedCost(Entry);
54385444

54395445
if (ST->hasSSE2())
54405446
if (const auto *Entry = CostTableLookup(SSE2InterleavedLoadTbl, Factor,
54415447
ETy.getSimpleVT()))
5442-
return MemOpCosts + Entry->Cost;
5448+
return GetDiscountedCost(Entry);
54435449
} else {
54445450
assert(Opcode == Instruction::Store &&
54455451
"Expected Store Instruction at this point");

llvm/test/Analysis/CostModel/X86/interleaved-load-i32-stride-2-indices-0u.ll

Lines changed: 9 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -13,24 +13,24 @@ target triple = "x86_64-unknown-linux-gnu"
1313
; CHECK: LV: Checking a loop in "test"
1414
;
1515
; SSE2: LV: Found an estimated cost of 1 for VF 1 For instruction: %v0 = load i32, i32* %in0, align 4
16-
; SSE2: LV: Found an estimated cost of 3 for VF 2 For instruction: %v0 = load i32, i32* %in0, align 4
17-
; SSE2: LV: Found an estimated cost of 4 for VF 4 For instruction: %v0 = load i32, i32* %in0, align 4
16+
; SSE2: LV: Found an estimated cost of 2 for VF 2 For instruction: %v0 = load i32, i32* %in0, align 4
17+
; SSE2: LV: Found an estimated cost of 3 for VF 4 For instruction: %v0 = load i32, i32* %in0, align 4
1818
; SSE2: LV: Found an estimated cost of 30 for VF 8 For instruction: %v0 = load i32, i32* %in0, align 4
1919
; SSE2: LV: Found an estimated cost of 60 for VF 16 For instruction: %v0 = load i32, i32* %in0, align 4
2020
;
2121
; AVX1: LV: Found an estimated cost of 1 for VF 1 For instruction: %v0 = load i32, i32* %in0, align 4
22-
; AVX1: LV: Found an estimated cost of 3 for VF 2 For instruction: %v0 = load i32, i32* %in0, align 4
23-
; AVX1: LV: Found an estimated cost of 3 for VF 4 For instruction: %v0 = load i32, i32* %in0, align 4
22+
; AVX1: LV: Found an estimated cost of 2 for VF 2 For instruction: %v0 = load i32, i32* %in0, align 4
23+
; AVX1: LV: Found an estimated cost of 2 for VF 4 For instruction: %v0 = load i32, i32* %in0, align 4
2424
; AVX1: LV: Found an estimated cost of 24 for VF 8 For instruction: %v0 = load i32, i32* %in0, align 4
2525
; AVX1: LV: Found an estimated cost of 48 for VF 16 For instruction: %v0 = load i32, i32* %in0, align 4
2626
; AVX1: LV: Found an estimated cost of 96 for VF 32 For instruction: %v0 = load i32, i32* %in0, align 4
2727
;
2828
; AVX2: LV: Found an estimated cost of 1 for VF 1 For instruction: %v0 = load i32, i32* %in0, align 4
29-
; AVX2: LV: Found an estimated cost of 3 for VF 2 For instruction: %v0 = load i32, i32* %in0, align 4
30-
; AVX2: LV: Found an estimated cost of 3 for VF 4 For instruction: %v0 = load i32, i32* %in0, align 4
31-
; AVX2: LV: Found an estimated cost of 6 for VF 8 For instruction: %v0 = load i32, i32* %in0, align 4
32-
; AVX2: LV: Found an estimated cost of 12 for VF 16 For instruction: %v0 = load i32, i32* %in0, align 4
33-
; AVX2: LV: Found an estimated cost of 24 for VF 32 For instruction: %v0 = load i32, i32* %in0, align 4
29+
; AVX2: LV: Found an estimated cost of 2 for VF 2 For instruction: %v0 = load i32, i32* %in0, align 4
30+
; AVX2: LV: Found an estimated cost of 2 for VF 4 For instruction: %v0 = load i32, i32* %in0, align 4
31+
; AVX2: LV: Found an estimated cost of 4 for VF 8 For instruction: %v0 = load i32, i32* %in0, align 4
32+
; AVX2: LV: Found an estimated cost of 8 for VF 16 For instruction: %v0 = load i32, i32* %in0, align 4
33+
; AVX2: LV: Found an estimated cost of 16 for VF 32 For instruction: %v0 = load i32, i32* %in0, align 4
3434
;
3535
; AVX512: LV: Found an estimated cost of 1 for VF 1 For instruction: %v0 = load i32, i32* %in0, align 4
3636
; AVX512: LV: Found an estimated cost of 1 for VF 2 For instruction: %v0 = load i32, i32* %in0, align 4

llvm/test/Analysis/CostModel/X86/interleaved-load-i32-stride-3-indices-01u.ll

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -26,11 +26,11 @@ target triple = "x86_64-unknown-linux-gnu"
2626
; AVX1: LV: Found an estimated cost of 188 for VF 32 For instruction: %v0 = load i32, i32* %in0, align 4
2727
;
2828
; AVX2: LV: Found an estimated cost of 1 for VF 1 For instruction: %v0 = load i32, i32* %in0, align 4
29-
; AVX2: LV: Found an estimated cost of 6 for VF 2 For instruction: %v0 = load i32, i32* %in0, align 4
30-
; AVX2: LV: Found an estimated cost of 5 for VF 4 For instruction: %v0 = load i32, i32* %in0, align 4
31-
; AVX2: LV: Found an estimated cost of 10 for VF 8 For instruction: %v0 = load i32, i32* %in0, align 4
32-
; AVX2: LV: Found an estimated cost of 20 for VF 16 For instruction: %v0 = load i32, i32* %in0, align 4
33-
; AVX2: LV: Found an estimated cost of 44 for VF 32 For instruction: %v0 = load i32, i32* %in0, align 4
29+
; AVX2: LV: Found an estimated cost of 5 for VF 2 For instruction: %v0 = load i32, i32* %in0, align 4
30+
; AVX2: LV: Found an estimated cost of 4 for VF 4 For instruction: %v0 = load i32, i32* %in0, align 4
31+
; AVX2: LV: Found an estimated cost of 8 for VF 8 For instruction: %v0 = load i32, i32* %in0, align 4
32+
; AVX2: LV: Found an estimated cost of 16 for VF 16 For instruction: %v0 = load i32, i32* %in0, align 4
33+
; AVX2: LV: Found an estimated cost of 34 for VF 32 For instruction: %v0 = load i32, i32* %in0, align 4
3434
;
3535
; AVX512: LV: Found an estimated cost of 1 for VF 1 For instruction: %v0 = load i32, i32* %in0, align 4
3636
; AVX512: LV: Found an estimated cost of 3 for VF 2 For instruction: %v0 = load i32, i32* %in0, align 4

llvm/test/Analysis/CostModel/X86/interleaved-load-i32-stride-3-indices-0uu.ll

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -26,11 +26,11 @@ target triple = "x86_64-unknown-linux-gnu"
2626
; AVX1: LV: Found an estimated cost of 100 for VF 32 For instruction: %v0 = load i32, i32* %in0, align 4
2727
;
2828
; AVX2: LV: Found an estimated cost of 1 for VF 1 For instruction: %v0 = load i32, i32* %in0, align 4
29-
; AVX2: LV: Found an estimated cost of 6 for VF 2 For instruction: %v0 = load i32, i32* %in0, align 4
30-
; AVX2: LV: Found an estimated cost of 5 for VF 4 For instruction: %v0 = load i32, i32* %in0, align 4
31-
; AVX2: LV: Found an estimated cost of 10 for VF 8 For instruction: %v0 = load i32, i32* %in0, align 4
32-
; AVX2: LV: Found an estimated cost of 20 for VF 16 For instruction: %v0 = load i32, i32* %in0, align 4
33-
; AVX2: LV: Found an estimated cost of 44 for VF 32 For instruction: %v0 = load i32, i32* %in0, align 4
29+
; AVX2: LV: Found an estimated cost of 4 for VF 2 For instruction: %v0 = load i32, i32* %in0, align 4
30+
; AVX2: LV: Found an estimated cost of 3 for VF 4 For instruction: %v0 = load i32, i32* %in0, align 4
31+
; AVX2: LV: Found an estimated cost of 6 for VF 8 For instruction: %v0 = load i32, i32* %in0, align 4
32+
; AVX2: LV: Found an estimated cost of 11 for VF 16 For instruction: %v0 = load i32, i32* %in0, align 4
33+
; AVX2: LV: Found an estimated cost of 23 for VF 32 For instruction: %v0 = load i32, i32* %in0, align 4
3434
;
3535
; AVX512: LV: Found an estimated cost of 1 for VF 1 For instruction: %v0 = load i32, i32* %in0, align 4
3636
; AVX512: LV: Found an estimated cost of 1 for VF 2 For instruction: %v0 = load i32, i32* %in0, align 4

llvm/test/Analysis/CostModel/X86/interleaved-load-i32-stride-4-indices-012u.ll

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -26,11 +26,11 @@ target triple = "x86_64-unknown-linux-gnu"
2626
; AVX1: LV: Found an estimated cost of 280 for VF 32 For instruction: %v0 = load i32, i32* %in0, align 4
2727
;
2828
; AVX2: LV: Found an estimated cost of 1 for VF 1 For instruction: %v0 = load i32, i32* %in0, align 4
29-
; AVX2: LV: Found an estimated cost of 5 for VF 2 For instruction: %v0 = load i32, i32* %in0, align 4
30-
; AVX2: LV: Found an estimated cost of 10 for VF 4 For instruction: %v0 = load i32, i32* %in0, align 4
31-
; AVX2: LV: Found an estimated cost of 20 for VF 8 For instruction: %v0 = load i32, i32* %in0, align 4
32-
; AVX2: LV: Found an estimated cost of 40 for VF 16 For instruction: %v0 = load i32, i32* %in0, align 4
33-
; AVX2: LV: Found an estimated cost of 84 for VF 32 For instruction: %v0 = load i32, i32* %in0, align 4
29+
; AVX2: LV: Found an estimated cost of 4 for VF 2 For instruction: %v0 = load i32, i32* %in0, align 4
30+
; AVX2: LV: Found an estimated cost of 8 for VF 4 For instruction: %v0 = load i32, i32* %in0, align 4
31+
; AVX2: LV: Found an estimated cost of 16 for VF 8 For instruction: %v0 = load i32, i32* %in0, align 4
32+
; AVX2: LV: Found an estimated cost of 32 for VF 16 For instruction: %v0 = load i32, i32* %in0, align 4
33+
; AVX2: LV: Found an estimated cost of 67 for VF 32 For instruction: %v0 = load i32, i32* %in0, align 4
3434
;
3535
; AVX512: LV: Found an estimated cost of 1 for VF 1 For instruction: %v0 = load i32, i32* %in0, align 4
3636
; AVX512: LV: Found an estimated cost of 4 for VF 2 For instruction: %v0 = load i32, i32* %in0, align 4

llvm/test/Analysis/CostModel/X86/interleaved-load-i32-stride-4-indices-01uu.ll

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -26,11 +26,11 @@ target triple = "x86_64-unknown-linux-gnu"
2626
; AVX1: LV: Found an estimated cost of 192 for VF 32 For instruction: %v0 = load i32, i32* %in0, align 4
2727
;
2828
; AVX2: LV: Found an estimated cost of 1 for VF 1 For instruction: %v0 = load i32, i32* %in0, align 4
29-
; AVX2: LV: Found an estimated cost of 5 for VF 2 For instruction: %v0 = load i32, i32* %in0, align 4
30-
; AVX2: LV: Found an estimated cost of 10 for VF 4 For instruction: %v0 = load i32, i32* %in0, align 4
31-
; AVX2: LV: Found an estimated cost of 20 for VF 8 For instruction: %v0 = load i32, i32* %in0, align 4
32-
; AVX2: LV: Found an estimated cost of 40 for VF 16 For instruction: %v0 = load i32, i32* %in0, align 4
33-
; AVX2: LV: Found an estimated cost of 84 for VF 32 For instruction: %v0 = load i32, i32* %in0, align 4
29+
; AVX2: LV: Found an estimated cost of 3 for VF 2 For instruction: %v0 = load i32, i32* %in0, align 4
30+
; AVX2: LV: Found an estimated cost of 6 for VF 4 For instruction: %v0 = load i32, i32* %in0, align 4
31+
; AVX2: LV: Found an estimated cost of 12 for VF 8 For instruction: %v0 = load i32, i32* %in0, align 4
32+
; AVX2: LV: Found an estimated cost of 24 for VF 16 For instruction: %v0 = load i32, i32* %in0, align 4
33+
; AVX2: LV: Found an estimated cost of 50 for VF 32 For instruction: %v0 = load i32, i32* %in0, align 4
3434
;
3535
; AVX512: LV: Found an estimated cost of 1 for VF 1 For instruction: %v0 = load i32, i32* %in0, align 4
3636
; AVX512: LV: Found an estimated cost of 3 for VF 2 For instruction: %v0 = load i32, i32* %in0, align 4

llvm/test/Analysis/CostModel/X86/interleaved-load-i32-stride-4-indices-0uuu.ll

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -26,11 +26,11 @@ target triple = "x86_64-unknown-linux-gnu"
2626
; AVX1: LV: Found an estimated cost of 104 for VF 32 For instruction: %v0 = load i32, i32* %in0, align 4
2727
;
2828
; AVX2: LV: Found an estimated cost of 1 for VF 1 For instruction: %v0 = load i32, i32* %in0, align 4
29-
; AVX2: LV: Found an estimated cost of 5 for VF 2 For instruction: %v0 = load i32, i32* %in0, align 4
30-
; AVX2: LV: Found an estimated cost of 10 for VF 4 For instruction: %v0 = load i32, i32* %in0, align 4
31-
; AVX2: LV: Found an estimated cost of 20 for VF 8 For instruction: %v0 = load i32, i32* %in0, align 4
32-
; AVX2: LV: Found an estimated cost of 40 for VF 16 For instruction: %v0 = load i32, i32* %in0, align 4
33-
; AVX2: LV: Found an estimated cost of 84 for VF 32 For instruction: %v0 = load i32, i32* %in0, align 4
29+
; AVX2: LV: Found an estimated cost of 2 for VF 2 For instruction: %v0 = load i32, i32* %in0, align 4
30+
; AVX2: LV: Found an estimated cost of 4 for VF 4 For instruction: %v0 = load i32, i32* %in0, align 4
31+
; AVX2: LV: Found an estimated cost of 8 for VF 8 For instruction: %v0 = load i32, i32* %in0, align 4
32+
; AVX2: LV: Found an estimated cost of 16 for VF 16 For instruction: %v0 = load i32, i32* %in0, align 4
33+
; AVX2: LV: Found an estimated cost of 33 for VF 32 For instruction: %v0 = load i32, i32* %in0, align 4
3434
;
3535
; AVX512: LV: Found an estimated cost of 1 for VF 1 For instruction: %v0 = load i32, i32* %in0, align 4
3636
; AVX512: LV: Found an estimated cost of 1 for VF 2 For instruction: %v0 = load i32, i32* %in0, align 4

0 commit comments

Comments
 (0)