Skip to content

Commit 264d113

Browse files
committed
AMDGPU/GISel: Introduce custom legalization of G_MUL
The generic legalizer framework is still used to reduce the problem to scalar multiplication with the bit size a multiple of 32. Generating optimal code sequences for big integer multiplication is somewhat tricky and has a number of target-specific intricacies: - The target has V_MAD_U64_U32 instructions that multiply two 32-bit factors and add a 64-bit accumulator. Most partial products should use this instruction. - The accumulator is mapped to consecutive 32-bit GPRs, and partial- product multiply-adds can feed the accumulator into each other directly. (The register allocator's support for that is somewhat limited, but that only matters for 128-bit integers and larger.) - OTOH, on some hardware, V_MAD_U64_U32 requires the accumulator to be stored in an even-aligned pair of GPRs. To avoid excessive register copies, it makes sense to compute odd partial products separately from even partial products (where a partial product src0[j0] * src1[j1] is "odd" if j0 + j1 is odd) and add both halves together as a final step. - We can combine G_MUL+G_ADD into a single cascade of multiply-adds. - The target can keep many carry-bits in flight simultaneously, so combining carries using G_UADDE is preferable over G_ZEXT + G_ADD. - Not addressed by this patch: When the factors are sign-extended, the V_MAD_I64_I32 instruction (signed version!) can be used. It is difficult to address these points generically: 1) Finding matching pairs of G_MUL and G_UMULH to find a wide multiply is expensive. We could add a G_UMUL_LOHI generic instruction and conditionally use that in the generic legalizer, but by itself this wouldn't allow us to use the accumulation capability of V_MAD_U64_U32. One could attempt to find matching G_ADD + G_UADDE post-legalization, but this is also expensive. 2) Similarly, making sense of the legalization outcome of a wide pre-legalization G_MUL+G_ADD pair is extremely expensive. 3) How could the generic legalizer possibly deal with the particular idiosyncracy of "odd" vs. "even" partial products. All this points in the direction of directly emitting an ideal code sequence during legalization, but the generic legalizer should not be burdened with such overly target-specific concerns. Hence, a custom legalization. Note that the implemented approach is different from that used by SelectionDAG because narrowing of scalars works differently in general. SelectionDAG iteratively cuts wide scalars into low and high halves until a legal size is reached. By contrast, GlobalISel does the narrowing in a single shot, which should be better for compile-time and for the quality of the generated code. This patch leaves three gaps open: 1. When the factors are uniform, we should execute the multiplication on the SALU. Register bank mapping already ensures this. However, the resulting code sequence is not optimal because it doesn't fully use the carry-in capabilities of S_ADDC_U32. (V_MAD_U64_U32 doesn't have a carry-in.) It is very difficult to fix this after the fact, so we should really use a different legalization sequence in this case. Unfortunately, we don't have a divergence analysis and so cannot make that choice. (This only matters for 128-bit integers and larger.) 2. Avoid unnecessary multiplies when sources are known to be zero- or sign-extended. The challenge is that the legalizer does not currently have access to GISelKnownBits. 3. When the G_MUL is followed by a G_ADD, we should consider combining the two instructions into a single multiply-add sequence, to utilize the accumulator of V_MAD_U64_U32 fully. (Unless the multiply has multiple uses and the implied duplication of the multiply is an overall negative). However, this is also not true when the factors are uniform: in that case, it is generally better to *not* combine the two operations, so that the multiply can be done on the SALU. Again, we don't have a divergence analysis available and so cannot make an informed choice. Differential Revision: https://reviews.llvm.org/D124844
1 parent 1a02db9 commit 264d113

File tree

13 files changed

+13173
-14274
lines changed

13 files changed

+13173
-14274
lines changed

llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp

Lines changed: 328 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -530,13 +530,22 @@ AMDGPULegalizerInfo::AMDGPULegalizerInfo(const GCNSubtarget &ST_,
530530

531531
if (ST.hasVOP3PInsts() && ST.hasAddNoCarry() && ST.hasIntClamp()) {
532532
// Full set of gfx9 features.
533-
getActionDefinitionsBuilder({G_ADD, G_SUB, G_MUL})
533+
getActionDefinitionsBuilder({G_ADD, G_SUB})
534534
.legalFor({S32, S16, V2S16})
535+
.clampMaxNumElementsStrict(0, S16, 2)
536+
.scalarize(0)
535537
.minScalar(0, S16)
538+
.widenScalarToNextMultipleOf(0, 32)
539+
.maxScalar(0, S32);
540+
541+
getActionDefinitionsBuilder(G_MUL)
542+
.legalFor({S32, S16, V2S16})
536543
.clampMaxNumElementsStrict(0, S16, 2)
544+
.scalarize(0)
545+
.minScalar(0, S16)
537546
.widenScalarToNextMultipleOf(0, 32)
538-
.maxScalar(0, S32)
539-
.scalarize(0);
547+
.custom();
548+
assert(ST.hasMad64_32());
540549

541550
getActionDefinitionsBuilder({G_UADDSAT, G_USUBSAT, G_SADDSAT, G_SSUBSAT})
542551
.legalFor({S32, S16, V2S16}) // Clamp modifier
@@ -546,13 +555,21 @@ AMDGPULegalizerInfo::AMDGPULegalizerInfo(const GCNSubtarget &ST_,
546555
.widenScalarToNextPow2(0, 32)
547556
.lower();
548557
} else if (ST.has16BitInsts()) {
549-
getActionDefinitionsBuilder({G_ADD, G_SUB, G_MUL})
558+
getActionDefinitionsBuilder({G_ADD, G_SUB})
550559
.legalFor({S32, S16})
551560
.minScalar(0, S16)
552561
.widenScalarToNextMultipleOf(0, 32)
553562
.maxScalar(0, S32)
554563
.scalarize(0);
555564

565+
getActionDefinitionsBuilder(G_MUL)
566+
.legalFor({S32, S16})
567+
.scalarize(0)
568+
.minScalar(0, S16)
569+
.widenScalarToNextMultipleOf(0, 32)
570+
.custom();
571+
assert(ST.hasMad64_32());
572+
556573
// Technically the saturating operations require clamp bit support, but this
557574
// was introduced at the same time as 16-bit operations.
558575
getActionDefinitionsBuilder({G_UADDSAT, G_USUBSAT})
@@ -569,12 +586,23 @@ AMDGPULegalizerInfo::AMDGPULegalizerInfo(const GCNSubtarget &ST_,
569586
.scalarize(0)
570587
.lower();
571588
} else {
572-
getActionDefinitionsBuilder({G_ADD, G_SUB, G_MUL})
589+
getActionDefinitionsBuilder({G_ADD, G_SUB})
573590
.legalFor({S32})
574591
.widenScalarToNextMultipleOf(0, 32)
575592
.clampScalar(0, S32, S32)
576593
.scalarize(0);
577594

595+
auto &Mul = getActionDefinitionsBuilder(G_MUL)
596+
.legalFor({S32})
597+
.scalarize(0)
598+
.minScalar(0, S32)
599+
.widenScalarToNextMultipleOf(0, 32);
600+
601+
if (ST.hasMad64_32())
602+
Mul.custom();
603+
else
604+
Mul.maxScalar(0, S32);
605+
578606
if (ST.hasIntClamp()) {
579607
getActionDefinitionsBuilder({G_UADDSAT, G_USUBSAT})
580608
.legalFor({S32}) // Clamp modifier.
@@ -1763,6 +1791,8 @@ bool AMDGPULegalizerInfo::legalizeCustom(LegalizerHelper &Helper,
17631791
return legalizeFFloor(MI, MRI, B);
17641792
case TargetOpcode::G_BUILD_VECTOR:
17651793
return legalizeBuildVector(MI, MRI, B);
1794+
case TargetOpcode::G_MUL:
1795+
return legalizeMul(Helper, MI);
17661796
case TargetOpcode::G_CTLZ:
17671797
case TargetOpcode::G_CTTZ:
17681798
return legalizeCTLZ_CTTZ(MI, MRI, B);
@@ -2861,6 +2891,299 @@ bool AMDGPULegalizerInfo::legalizeBuildVector(
28612891
return true;
28622892
}
28632893

2894+
// Build a big integer multiply or multiply-add using MAD_64_32 instructions.
2895+
//
2896+
// Source and accumulation registers must all be 32-bits.
2897+
//
2898+
// TODO: When the multiply is uniform, we should produce a code sequence
2899+
// that is better suited to instruction selection on the SALU. Instead of
2900+
// the outer loop going over parts of the result, the outer loop should go
2901+
// over parts of one of the factors. This should result in instruction
2902+
// selection that makes full use of S_ADDC_U32 instructions.
2903+
void AMDGPULegalizerInfo::buildMultiply(
2904+
LegalizerHelper &Helper, MutableArrayRef<Register> Accum,
2905+
ArrayRef<Register> Src0, ArrayRef<Register> Src1,
2906+
bool UsePartialMad64_32, bool SeparateOddAlignedProducts) const {
2907+
// Use (possibly empty) vectors of S1 registers to represent the set of
2908+
// carries from one pair of positions to the next.
2909+
using Carry = SmallVector<Register, 2>;
2910+
2911+
MachineIRBuilder &B = Helper.MIRBuilder;
2912+
2913+
const LLT S1 = LLT::scalar(1);
2914+
const LLT S32 = LLT::scalar(32);
2915+
const LLT S64 = LLT::scalar(64);
2916+
2917+
Register Zero32;
2918+
Register Zero64;
2919+
2920+
auto getZero32 = [&]() -> Register {
2921+
if (!Zero32)
2922+
Zero32 = B.buildConstant(S32, 0).getReg(0);
2923+
return Zero32;
2924+
};
2925+
auto getZero64 = [&]() -> Register {
2926+
if (!Zero64)
2927+
Zero64 = B.buildConstant(S64, 0).getReg(0);
2928+
return Zero64;
2929+
};
2930+
2931+
// Merge the given carries into the 32-bit LocalAccum, which is modified
2932+
// in-place.
2933+
//
2934+
// Returns the carry-out, which is a single S1 register or null.
2935+
auto mergeCarry =
2936+
[&](Register &LocalAccum, const Carry &CarryIn) -> Register {
2937+
if (CarryIn.empty())
2938+
return Register();
2939+
2940+
bool HaveCarryOut = true;
2941+
Register CarryAccum;
2942+
if (CarryIn.size() == 1) {
2943+
if (!LocalAccum) {
2944+
LocalAccum = B.buildZExt(S32, CarryIn[0]).getReg(0);
2945+
return Register();
2946+
}
2947+
2948+
CarryAccum = getZero32();
2949+
} else {
2950+
CarryAccum = B.buildZExt(S32, CarryIn[0]).getReg(0);
2951+
for (unsigned i = 1; i + 1 < CarryIn.size(); ++i) {
2952+
CarryAccum =
2953+
B.buildUAdde(S32, S1, CarryAccum, getZero32(), CarryIn[i])
2954+
.getReg(0);
2955+
}
2956+
2957+
if (!LocalAccum) {
2958+
LocalAccum = getZero32();
2959+
HaveCarryOut = false;
2960+
}
2961+
}
2962+
2963+
auto Add =
2964+
B.buildUAdde(S32, S1, CarryAccum, LocalAccum, CarryIn.back());
2965+
LocalAccum = Add.getReg(0);
2966+
return HaveCarryOut ? Add.getReg(1) : Register();
2967+
};
2968+
2969+
// Build a multiply-add chain to compute
2970+
//
2971+
// LocalAccum + (partial products at DstIndex)
2972+
// + (opportunistic subset of CarryIn)
2973+
//
2974+
// LocalAccum is an array of one or two 32-bit registers that are updated
2975+
// in-place. The incoming registers may be null.
2976+
//
2977+
// In some edge cases, carry-ins can be consumed "for free". In that case,
2978+
// the consumed carry bits are removed from CarryIn in-place.
2979+
auto buildMadChain =
2980+
[&](MutableArrayRef<Register> LocalAccum, unsigned DstIndex, Carry &CarryIn)
2981+
-> Carry {
2982+
assert((DstIndex + 1 < Accum.size() && LocalAccum.size() == 2) ||
2983+
(DstIndex + 1 >= Accum.size() && LocalAccum.size() == 1));
2984+
2985+
Register Tmp;
2986+
Carry CarryOut;
2987+
unsigned j0 = 0;
2988+
2989+
// Use plain 32-bit multiplication for the most significant part of the
2990+
// result by default.
2991+
if (LocalAccum.size() == 1 &&
2992+
(!UsePartialMad64_32 || !CarryIn.empty())) {
2993+
do {
2994+
unsigned j1 = DstIndex - j0;
2995+
auto Mul = B.buildMul(S32, Src0[j0], Src1[j1]);
2996+
if (!LocalAccum[0]) {
2997+
LocalAccum[0] = Mul.getReg(0);
2998+
} else {
2999+
if (CarryIn.empty()) {
3000+
LocalAccum[0] = B.buildAdd(S32, LocalAccum[0], Mul).getReg(0);
3001+
} else {
3002+
LocalAccum[0] =
3003+
B.buildUAdde(S32, S1, LocalAccum[0], Mul, CarryIn.back())
3004+
.getReg(0);
3005+
CarryIn.pop_back();
3006+
}
3007+
}
3008+
++j0;
3009+
} while (j0 <= DstIndex && (!UsePartialMad64_32 || !CarryIn.empty()));
3010+
}
3011+
3012+
// Build full 64-bit multiplies.
3013+
if (j0 <= DstIndex) {
3014+
bool HaveSmallAccum = false;
3015+
Register Tmp;
3016+
3017+
if (LocalAccum[0]) {
3018+
if (LocalAccum.size() == 1) {
3019+
Tmp = B.buildAnyExt(S64, LocalAccum[0]).getReg(0);
3020+
HaveSmallAccum = true;
3021+
} else if (LocalAccum[1]) {
3022+
Tmp = B.buildMerge(S64, LocalAccum).getReg(0);
3023+
HaveSmallAccum = false;
3024+
} else {
3025+
Tmp = B.buildZExt(S64, LocalAccum[0]).getReg(0);
3026+
HaveSmallAccum = true;
3027+
}
3028+
} else {
3029+
assert(LocalAccum.size() == 1 || !LocalAccum[1]);
3030+
Tmp = getZero64();
3031+
HaveSmallAccum = true;
3032+
}
3033+
3034+
do {
3035+
unsigned j1 = DstIndex - j0;
3036+
auto Mad = B.buildInstr(AMDGPU::G_AMDGPU_MAD_U64_U32, {S64, S1},
3037+
{Src0[j0], Src1[j1], Tmp});
3038+
Tmp = Mad.getReg(0);
3039+
if (!HaveSmallAccum)
3040+
CarryOut.push_back(Mad.getReg(1));
3041+
HaveSmallAccum = false;
3042+
++j0;
3043+
} while (j0 <= DstIndex);
3044+
3045+
auto Unmerge = B.buildUnmerge(S32, Tmp);
3046+
LocalAccum[0] = Unmerge.getReg(0);
3047+
if (LocalAccum.size() > 1)
3048+
LocalAccum[1] = Unmerge.getReg(1);
3049+
}
3050+
3051+
return CarryOut;
3052+
};
3053+
3054+
// Outer multiply loop, iterating over destination parts from least
3055+
// significant to most significant parts.
3056+
//
3057+
// The columns of the following diagram correspond to the destination parts
3058+
// affected by one iteration of the outer loop (ignoring boundary
3059+
// conditions).
3060+
//
3061+
// Dest index relative to 2 * i: 1 0 -1
3062+
// ------
3063+
// Carries from previous iteration: e o
3064+
// Even-aligned partial product sum: E E .
3065+
// Odd-aligned partial product sum: O O
3066+
//
3067+
// 'o' is OddCarry, 'e' is EvenCarry.
3068+
// EE and OO are computed from partial products via buildMadChain and use
3069+
// accumulation where possible and appropriate.
3070+
//
3071+
Register SeparateOddCarry;
3072+
Carry EvenCarry;
3073+
Carry OddCarry;
3074+
3075+
for (unsigned i = 0; i <= Accum.size() / 2; ++i) {
3076+
Carry OddCarryIn = std::move(OddCarry);
3077+
Carry EvenCarryIn = std::move(EvenCarry);
3078+
OddCarry.clear();
3079+
EvenCarry.clear();
3080+
3081+
// Partial products at offset 2 * i.
3082+
if (2 * i < Accum.size()) {
3083+
auto LocalAccum = Accum.drop_front(2 * i).take_front(2);
3084+
EvenCarry = buildMadChain(LocalAccum, 2 * i, EvenCarryIn);
3085+
}
3086+
3087+
// Partial products at offset 2 * i - 1.
3088+
if (i > 0) {
3089+
if (!SeparateOddAlignedProducts) {
3090+
auto LocalAccum = Accum.drop_front(2 * i - 1).take_front(2);
3091+
OddCarry = buildMadChain(LocalAccum, 2 * i - 1, OddCarryIn);
3092+
} else {
3093+
bool IsHighest = 2 * i >= Accum.size();
3094+
Register SeparateOddOut[2];
3095+
auto LocalAccum = makeMutableArrayRef(SeparateOddOut)
3096+
.take_front(IsHighest ? 1 : 2);
3097+
OddCarry = buildMadChain(LocalAccum, 2 * i - 1, OddCarryIn);
3098+
3099+
MachineInstr *Lo;
3100+
3101+
if (i == 1) {
3102+
if (!IsHighest)
3103+
Lo = B.buildUAddo(S32, S1, Accum[2 * i - 1], SeparateOddOut[0]);
3104+
else
3105+
Lo = B.buildAdd(S32, Accum[2 * i - 1], SeparateOddOut[0]);
3106+
} else {
3107+
Lo = B.buildUAdde(S32, S1, Accum[2 * i - 1], SeparateOddOut[0],
3108+
SeparateOddCarry);
3109+
}
3110+
Accum[2 * i - 1] = Lo->getOperand(0).getReg();
3111+
3112+
if (!IsHighest) {
3113+
auto Hi = B.buildUAdde(S32, S1, Accum[2 * i], SeparateOddOut[1],
3114+
Lo->getOperand(1).getReg());
3115+
Accum[2 * i] = Hi.getReg(0);
3116+
SeparateOddCarry = Hi.getReg(1);
3117+
}
3118+
}
3119+
}
3120+
3121+
// Add in the carries from the previous iteration
3122+
if (i > 0) {
3123+
if (Register CarryOut = mergeCarry(Accum[2 * i - 1], OddCarryIn))
3124+
EvenCarryIn.push_back(CarryOut);
3125+
3126+
if (2 * i < Accum.size()) {
3127+
if (Register CarryOut = mergeCarry(Accum[2 * i], EvenCarryIn))
3128+
OddCarry.push_back(CarryOut);
3129+
}
3130+
}
3131+
}
3132+
}
3133+
3134+
// Custom narrowing of wide multiplies using wide multiply-add instructions.
3135+
//
3136+
// TODO: If the multiply is followed by an addition, we should attempt to
3137+
// integrate it to make better use of V_MAD_U64_U32's multiply-add capabilities.
3138+
bool AMDGPULegalizerInfo::legalizeMul(LegalizerHelper &Helper,
3139+
MachineInstr &MI) const {
3140+
assert(ST.hasMad64_32());
3141+
assert(MI.getOpcode() == TargetOpcode::G_MUL);
3142+
3143+
MachineIRBuilder &B = Helper.MIRBuilder;
3144+
MachineRegisterInfo &MRI = *B.getMRI();
3145+
3146+
Register DstReg = MI.getOperand(0).getReg();
3147+
Register Src0 = MI.getOperand(1).getReg();
3148+
Register Src1 = MI.getOperand(2).getReg();
3149+
3150+
LLT Ty = MRI.getType(DstReg);
3151+
assert(Ty.isScalar());
3152+
3153+
unsigned Size = Ty.getSizeInBits();
3154+
unsigned NumParts = Size / 32;
3155+
assert((Size % 32) == 0);
3156+
assert(NumParts >= 2);
3157+
3158+
// Whether to use MAD_64_32 for partial products whose high half is
3159+
// discarded. This avoids some ADD instructions but risks false dependency
3160+
// stalls on some subtargets in some cases.
3161+
const bool UsePartialMad64_32 = ST.getGeneration() < AMDGPUSubtarget::GFX10;
3162+
3163+
// Whether to compute odd-aligned partial products separately. This is
3164+
// advisable on subtargets where the accumulator of MAD_64_32 must be placed
3165+
// in an even-aligned VGPR.
3166+
const bool SeparateOddAlignedProducts = ST.hasFullRate64Ops();
3167+
3168+
LLT S32 = LLT::scalar(32);
3169+
SmallVector<Register, 2> Src0Parts, Src1Parts;
3170+
for (unsigned i = 0; i < NumParts; ++i) {
3171+
Src0Parts.push_back(MRI.createGenericVirtualRegister(S32));
3172+
Src1Parts.push_back(MRI.createGenericVirtualRegister(S32));
3173+
}
3174+
B.buildUnmerge(Src0Parts, Src0);
3175+
B.buildUnmerge(Src1Parts, Src1);
3176+
3177+
SmallVector<Register, 2> AccumRegs(NumParts);
3178+
buildMultiply(Helper, AccumRegs, Src0Parts, Src1Parts, UsePartialMad64_32,
3179+
SeparateOddAlignedProducts);
3180+
3181+
B.buildMerge(DstReg, AccumRegs);
3182+
MI.eraseFromParent();
3183+
return true;
3184+
3185+
}
3186+
28643187
// Legalize ctlz/cttz to ffbh/ffbl instead of the default legalization to
28653188
// ctlz/cttz_zero_undef. This allows us to fix up the result for the zero input
28663189
// case with a single min instruction instead of a compare+select.

llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.h

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -88,6 +88,12 @@ class AMDGPULegalizerInfo final : public LegalizerInfo {
8888

8989
bool legalizeBuildVector(MachineInstr &MI, MachineRegisterInfo &MRI,
9090
MachineIRBuilder &B) const;
91+
92+
void buildMultiply(LegalizerHelper &Helper, MutableArrayRef<Register> Accum,
93+
ArrayRef<Register> Src0, ArrayRef<Register> Src1,
94+
bool UsePartialMad64_32,
95+
bool SeparateOddAlignedProducts) const;
96+
bool legalizeMul(LegalizerHelper &Helper, MachineInstr &MI) const;
9197
bool legalizeCTLZ_CTTZ(MachineInstr &MI, MachineRegisterInfo &MRI,
9298
MachineIRBuilder &B) const;
9399

0 commit comments

Comments
 (0)