Skip to content

[Scalar] Dedicated pass for identifying redundant operations on packed bytes #146364

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

zGoldthorpe
Copy link
Contributor

This pass addresses a number of cases where integers are treated as packed vectors of bytes which get needlessly unpacked and repacked.

Although some of these patterns can be handled by instcombine, they are sensitive to obfuscations from intermediating cast instructions. For example, consider the function below, which concatenates two 32-bit integers into a single 64-bit result.

; see llvm/test/Transforms/PackedIntegerCombine/int2int.ll
define i64 @combine(i32 %bot, i32 %top) {
  %base = zext i32 %bot to i64

  %mask.0 = and i32 %top, 255
  %zext.0 = zext i32 %mask.0 to i64
  %get.0 = shl i64 %zext.0, 32
  %out.0 = or i64 %base, %get.0

  %shr.1 = lshr i32 %top, 8
  %mask.1 = and i32 %shr.1, 255
  %zext.1 = zext i32 %mask.1 to i64
  %get.1 = shl i64 %zext.1, 40
  %out.1 = or i64 %out.0, %get.1

  %shr.2 = lshr i32 %top, 16
  %mask.2 = and i32 %shr.2, 255
  %zext.2 = zext i32 %mask.2 to i64
  %get.2 = shl i64 %zext.2, 48
  %out.2 = or i64 %out.1, %get.2

  %shr.3 = lshr i32 %top, 24
  %mask.3 = and i32 %shr.3, 255
  %zext.3 = zext i32 %mask.3 to i64
  %get.3 = shl i64 %zext.3, 56
  %out.3 = or i64 %out.2, %get.3

  ret i64 %out.3
}

These instructions may be produced in cases where %top is being treated as a packed integer representing a <4 x i8>, whose elements are being inserted into a larger packed integer, where the lowest four bytes have already been acquired from %bot.

There are also similar cases which cannot be reduced because of tensions between actual fixed-width vector types and integer types. For example, in the function below, where a <4 x i8> is being recast into a packed i32 representation.

; see llvm/test/Transforms/PackedIntegerCombine/vec2int.ll
define i32 @extract_i32(<4 x i8> %from) {
  %mask.0 = extractelement <4 x i8> %from, i64 0
  %get.0 = zext i8 %mask.0 to i32

  %mask.1 = extractelement <4 x i8> %from, i64 1
  %zext.1 = zext i8 %mask.1 to i32
  %get.1 = shl i32 %zext.1, 8
  %out.1 = or i32 %get.0, %get.1

  %mask.2 = extractelement <4 x i8> %from, i64 2
  %zext.2 = zext i8 %mask.2 to i32
  %get.2 = shl i32 %zext.2, 16

  %mask.3 = extractelement <4 x i8> %from, i64 3
  %zext.3 = zext i8 %mask.3 to i32
  %get.3 = shl i32 %zext.3, 24
  %out.2 = or i32 %get.2, %get.3

  %out = or i32 %out.1, %out.2
  ret i32 %out
}

This pass addresses these patterns (and others) uniformly by tracking individual bytes of integer and vector-of-integer types through various instructions, and identifying opportunities to eliminate unnecessary unpacks or repacks, and rewrites them accordingly.

@llvmbot
Copy link
Member

llvmbot commented Jun 30, 2025

@llvm/pr-subscribers-backend-amdgpu

@llvm/pr-subscribers-llvm-transforms

Author: None (zGoldthorpe)

Changes

This pass addresses a number of cases where integers are treated as packed vectors of bytes which get needlessly unpacked and repacked.

Although some of these patterns can be handled by instcombine, they are sensitive to obfuscations from intermediating cast instructions. For example, consider the function below, which concatenates two 32-bit integers into a single 64-bit result.

; see llvm/test/Transforms/PackedIntegerCombine/int2int.ll
define i64 @<!-- -->combine(i32 %bot, i32 %top) {
  %base = zext i32 %bot to i64

  %mask.0 = and i32 %top, 255
  %zext.0 = zext i32 %mask.0 to i64
  %get.0 = shl i64 %zext.0, 32
  %out.0 = or i64 %base, %get.0

  %shr.1 = lshr i32 %top, 8
  %mask.1 = and i32 %shr.1, 255
  %zext.1 = zext i32 %mask.1 to i64
  %get.1 = shl i64 %zext.1, 40
  %out.1 = or i64 %out.0, %get.1

  %shr.2 = lshr i32 %top, 16
  %mask.2 = and i32 %shr.2, 255
  %zext.2 = zext i32 %mask.2 to i64
  %get.2 = shl i64 %zext.2, 48
  %out.2 = or i64 %out.1, %get.2

  %shr.3 = lshr i32 %top, 24
  %mask.3 = and i32 %shr.3, 255
  %zext.3 = zext i32 %mask.3 to i64
  %get.3 = shl i64 %zext.3, 56
  %out.3 = or i64 %out.2, %get.3

  ret i64 %out.3
}

These instructions may be produced in cases where %top is being treated as a packed integer representing a &lt;4 x i8&gt;, whose elements are being inserted into a larger packed integer, where the lowest four bytes have already been acquired from %bot.

There are also similar cases which cannot be reduced because of tensions between actual fixed-width vector types and integer types. For example, in the function below, where a &lt;4 x i8&gt; is being recast into a packed i32 representation.

; see llvm/test/Transforms/PackedIntegerCombine/vec2int.ll
define i32 @<!-- -->extract_i32(&lt;4 x i8&gt; %from) {
  %mask.0 = extractelement &lt;4 x i8&gt; %from, i64 0
  %get.0 = zext i8 %mask.0 to i32

  %mask.1 = extractelement &lt;4 x i8&gt; %from, i64 1
  %zext.1 = zext i8 %mask.1 to i32
  %get.1 = shl i32 %zext.1, 8
  %out.1 = or i32 %get.0, %get.1

  %mask.2 = extractelement &lt;4 x i8&gt; %from, i64 2
  %zext.2 = zext i8 %mask.2 to i32
  %get.2 = shl i32 %zext.2, 16

  %mask.3 = extractelement &lt;4 x i8&gt; %from, i64 3
  %zext.3 = zext i8 %mask.3 to i32
  %get.3 = shl i32 %zext.3, 24
  %out.2 = or i32 %get.2, %get.3

  %out = or i32 %out.1, %out.2
  ret i32 %out
}

This pass addresses these patterns (and others) uniformly by tracking individual bytes of integer and vector-of-integer types through various instructions, and identifying opportunities to eliminate unnecessary unpacks or repacks, and rewrites them accordingly.


Patch is 215.94 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/146364.diff

26 Files Affected:

  • (modified) llvm/include/llvm/InitializePasses.h (+1)
  • (modified) llvm/include/llvm/Transforms/Scalar.h (+7)
  • (added) llvm/include/llvm/Transforms/Scalar/PackedIntegerCombinePass.h (+32)
  • (modified) llvm/lib/Passes/PassBuilder.cpp (+1)
  • (modified) llvm/lib/Passes/PassBuilderPipelines.cpp (+7)
  • (modified) llvm/lib/Passes/PassRegistry.def (+1)
  • (modified) llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp (+9-2)
  • (modified) llvm/lib/Transforms/Scalar/CMakeLists.txt (+1)
  • (added) llvm/lib/Transforms/Scalar/PackedIntegerCombinePass.cpp (+1936)
  • (modified) llvm/lib/Transforms/Scalar/Scalar.cpp (+1)
  • (modified) llvm/test/CodeGen/AMDGPU/combine-vload-extract.ll (+3-3)
  • (modified) llvm/test/CodeGen/AMDGPU/llc-pipeline.ll (+3)
  • (modified) llvm/test/CodeGen/AMDGPU/sdwa-peephole.ll (+11-94)
  • (modified) llvm/test/CodeGen/AMDGPU/splitkit-getsubrangeformask.ll (+83-85)
  • (modified) llvm/test/Other/new-pm-defaults.ll (+1)
  • (modified) llvm/test/Other/new-pm-thinlto-postlink-defaults.ll (+1)
  • (modified) llvm/test/Other/new-pm-thinlto-postlink-pgo-defaults.ll (+1)
  • (modified) llvm/test/Other/new-pm-thinlto-postlink-samplepgo-defaults.ll (+1)
  • (modified) llvm/test/Other/new-pm-thinlto-prelink-defaults.ll (+1)
  • (modified) llvm/test/Other/new-pm-thinlto-prelink-pgo-defaults.ll (+1)
  • (modified) llvm/test/Other/new-pm-thinlto-prelink-samplepgo-defaults.ll (+1)
  • (added) llvm/test/Transforms/PackedIntegerCombine/instructions.ll (+601)
  • (added) llvm/test/Transforms/PackedIntegerCombine/int2int.ll (+302)
  • (added) llvm/test/Transforms/PackedIntegerCombine/int2vec.ll (+393)
  • (added) llvm/test/Transforms/PackedIntegerCombine/vec2int.ll (+480)
  • (added) llvm/test/Transforms/PackedIntegerCombine/vec2vec.ll (+294)
diff --git a/llvm/include/llvm/InitializePasses.h b/llvm/include/llvm/InitializePasses.h
index 1c4ed3843b390..7e934e635c063 100644
--- a/llvm/include/llvm/InitializePasses.h
+++ b/llvm/include/llvm/InitializePasses.h
@@ -237,6 +237,7 @@ initializeOptimizationRemarkEmitterWrapperPassPass(PassRegistry &);
 LLVM_ABI void initializeOptimizePHIsLegacyPass(PassRegistry &);
 LLVM_ABI void initializePEILegacyPass(PassRegistry &);
 LLVM_ABI void initializePHIEliminationPass(PassRegistry &);
+LLVM_ABI void initializePackedIntegerCombineLegacyPassPass(PassRegistry &);
 LLVM_ABI void initializePartiallyInlineLibCallsLegacyPassPass(PassRegistry &);
 LLVM_ABI void initializePatchableFunctionLegacyPass(PassRegistry &);
 LLVM_ABI void initializePeepholeOptimizerLegacyPass(PassRegistry &);
diff --git a/llvm/include/llvm/Transforms/Scalar.h b/llvm/include/llvm/Transforms/Scalar.h
index 1398f171b0f78..ec9d89507c375 100644
--- a/llvm/include/llvm/Transforms/Scalar.h
+++ b/llvm/include/llvm/Transforms/Scalar.h
@@ -154,6 +154,13 @@ LLVM_ABI FunctionPass *
 createInferAddressSpacesPass(unsigned AddressSpace = ~0u);
 LLVM_ABI extern char &InferAddressSpacesID;
 
+//===----------------------------------------------------------------------===//
+//
+// PackedIntegerCombinePass - Tracks individual bytes through instructions to
+// systematically identify redundant byte packing or unpacking operations.
+//
+LLVM_ABI FunctionPass *createPackedIntegerCombinePass();
+
 //===----------------------------------------------------------------------===//
 //
 // PartiallyInlineLibCalls - Tries to inline the fast path of library
diff --git a/llvm/include/llvm/Transforms/Scalar/PackedIntegerCombinePass.h b/llvm/include/llvm/Transforms/Scalar/PackedIntegerCombinePass.h
new file mode 100644
index 0000000000000..a5916e2e611cf
--- /dev/null
+++ b/llvm/include/llvm/Transforms/Scalar/PackedIntegerCombinePass.h
@@ -0,0 +1,32 @@
+//===- PackedIntegerCombinePass.h -------------------------------*- C++ -*-===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+/// \file
+///
+/// This file provides the interface for LLVM's Packed Integer Combine pass.
+/// This pass tries to treat integers as packed chunks of individual bytes,
+/// and leverage this to coalesce needlessly fragmented
+/// computations.
+///
+//===----------------------------------------------------------------------===//
+
+#ifndef LLVM_TRANSFORMS_SCALAR_PACKEDINTCOMBINE_H
+#define LLVM_TRANSFORMS_SCALAR_PACKEDINTCOMBINE_H
+
+#include "llvm/IR/PassManager.h"
+
+namespace llvm {
+
+class PackedIntegerCombinePass
+    : public PassInfoMixin<PackedIntegerCombinePass> {
+public:
+  PreservedAnalyses run(Function &F, FunctionAnalysisManager &AM);
+};
+
+} // end namespace llvm
+
+#endif // LLVM_TRANSFORMS_SCALAR_PACKEDINTCOMBINE_H
diff --git a/llvm/lib/Passes/PassBuilder.cpp b/llvm/lib/Passes/PassBuilder.cpp
index 0697a0a6b4c74..7a382ace34dbc 100644
--- a/llvm/lib/Passes/PassBuilder.cpp
+++ b/llvm/lib/Passes/PassBuilder.cpp
@@ -313,6 +313,7 @@
 #include "llvm/Transforms/Scalar/MergedLoadStoreMotion.h"
 #include "llvm/Transforms/Scalar/NaryReassociate.h"
 #include "llvm/Transforms/Scalar/NewGVN.h"
+#include "llvm/Transforms/Scalar/PackedIntegerCombinePass.h"
 #include "llvm/Transforms/Scalar/PartiallyInlineLibCalls.h"
 #include "llvm/Transforms/Scalar/PlaceSafepoints.h"
 #include "llvm/Transforms/Scalar/Reassociate.h"
diff --git a/llvm/lib/Passes/PassBuilderPipelines.cpp b/llvm/lib/Passes/PassBuilderPipelines.cpp
index c83d2dc1f1514..2da72606bc47a 100644
--- a/llvm/lib/Passes/PassBuilderPipelines.cpp
+++ b/llvm/lib/Passes/PassBuilderPipelines.cpp
@@ -121,6 +121,7 @@
 #include "llvm/Transforms/Scalar/MemCpyOptimizer.h"
 #include "llvm/Transforms/Scalar/MergedLoadStoreMotion.h"
 #include "llvm/Transforms/Scalar/NewGVN.h"
+#include "llvm/Transforms/Scalar/PackedIntegerCombinePass.h"
 #include "llvm/Transforms/Scalar/Reassociate.h"
 #include "llvm/Transforms/Scalar/SCCP.h"
 #include "llvm/Transforms/Scalar/SROA.h"
@@ -542,6 +543,9 @@ PassBuilder::buildO1FunctionSimplificationPipeline(OptimizationLevel Level,
   // opportunities that creates).
   FPM.addPass(BDCEPass());
 
+  // Simplify bit-packed operations before cleaning up with instcombine.
+  FPM.addPass(PackedIntegerCombinePass());
+
   // Run instcombine after redundancy and dead bit elimination to exploit
   // opportunities opened up by them.
   FPM.addPass(InstCombinePass());
@@ -743,6 +747,9 @@ PassBuilder::buildFunctionSimplificationPipeline(OptimizationLevel Level,
   // opportunities that creates).
   FPM.addPass(BDCEPass());
 
+  // Simplify bit-packed operations before cleaning up with instcombine.
+  FPM.addPass(PackedIntegerCombinePass());
+
   // Run instcombine after redundancy and dead bit elimination to exploit
   // opportunities opened up by them.
   FPM.addPass(InstCombinePass());
diff --git a/llvm/lib/Passes/PassRegistry.def b/llvm/lib/Passes/PassRegistry.def
index 65276489e6f02..6f1c405a5efa7 100644
--- a/llvm/lib/Passes/PassRegistry.def
+++ b/llvm/lib/Passes/PassRegistry.def
@@ -476,6 +476,7 @@ FUNCTION_PASS("objc-arc", ObjCARCOptPass())
 FUNCTION_PASS("objc-arc-contract", ObjCARCContractPass())
 FUNCTION_PASS("objc-arc-expand", ObjCARCExpandPass())
 FUNCTION_PASS("pa-eval", PAEvalPass())
+FUNCTION_PASS("packedintcombine", PackedIntegerCombinePass())
 FUNCTION_PASS("partially-inline-libcalls", PartiallyInlineLibCallsPass())
 FUNCTION_PASS("pgo-memop-opt", PGOMemOPSizeOpt())
 FUNCTION_PASS("place-safepoints", PlaceSafepointsPass())
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp b/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
index c3536113e9bef..8f0ef348fe778 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
@@ -104,6 +104,7 @@
 #include "llvm/Transforms/Scalar/InferAddressSpaces.h"
 #include "llvm/Transforms/Scalar/LoopDataPrefetch.h"
 #include "llvm/Transforms/Scalar/NaryReassociate.h"
+#include "llvm/Transforms/Scalar/PackedIntegerCombinePass.h"
 #include "llvm/Transforms/Scalar/SeparateConstOffsetFromGEP.h"
 #include "llvm/Transforms/Scalar/Sink.h"
 #include "llvm/Transforms/Scalar/StraightLineStrengthReduce.h"
@@ -1378,8 +1379,11 @@ void AMDGPUPassConfig::addCodeGenPrepare() {
 
   TargetPassConfig::addCodeGenPrepare();
 
-  if (isPassEnabled(EnableLoadStoreVectorizer))
+  if (isPassEnabled(EnableLoadStoreVectorizer)) {
     addPass(createLoadStoreVectorizerPass());
+    // LSV pass opens up more opportunities for packed integer combining.
+    addPass(createPackedIntegerCombinePass());
+  }
 
   // LowerSwitch pass may introduce unreachable blocks that can
   // cause unexpected behavior for subsequent passes. Placing it
@@ -2101,8 +2105,11 @@ void AMDGPUCodeGenPassBuilder::addCodeGenPrepare(AddIRPass &addPass) const {
 
   Base::addCodeGenPrepare(addPass);
 
-  if (isPassEnabled(EnableLoadStoreVectorizer))
+  if (isPassEnabled(EnableLoadStoreVectorizer)) {
     addPass(LoadStoreVectorizerPass());
+    // LSV pass opens up more opportunities for packed integer combining.
+    addPass(PackedIntegerCombinePass());
+  }
 
   // LowerSwitch pass may introduce unreachable blocks that can cause unexpected
   // behavior for subsequent passes. Placing it here seems better that these
diff --git a/llvm/lib/Transforms/Scalar/CMakeLists.txt b/llvm/lib/Transforms/Scalar/CMakeLists.txt
index 84a5b02043d01..d45a3785f9f8f 100644
--- a/llvm/lib/Transforms/Scalar/CMakeLists.txt
+++ b/llvm/lib/Transforms/Scalar/CMakeLists.txt
@@ -61,6 +61,7 @@ add_llvm_component_library(LLVMScalarOpts
   NaryReassociate.cpp
   NewGVN.cpp
   PartiallyInlineLibCalls.cpp
+  PackedIntegerCombinePass.cpp
   PlaceSafepoints.cpp
   Reassociate.cpp
   Reg2Mem.cpp
diff --git a/llvm/lib/Transforms/Scalar/PackedIntegerCombinePass.cpp b/llvm/lib/Transforms/Scalar/PackedIntegerCombinePass.cpp
new file mode 100644
index 0000000000000..31edd28069a2b
--- /dev/null
+++ b/llvm/lib/Transforms/Scalar/PackedIntegerCombinePass.cpp
@@ -0,0 +1,1936 @@
+//===- PackedIntegerCombinePass.h -------------------------------*- C++ -*-===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+///
+/// This file provides the interface for LLVM's Packed Integer Combine pass.
+/// This pass tries to treat integers as packed chunks of individual bytes,
+/// and leverage this to coalesce needlessly fragmented
+/// computations.
+///
+//===----------------------------------------------------------------------===//
+
+#include "llvm/Transforms/Scalar/PackedIntegerCombinePass.h"
+#include "llvm/ADT/PostOrderIterator.h"
+#include "llvm/ADT/SetVector.h"
+#include "llvm/ADT/SmallBitVector.h"
+#include "llvm/ADT/SmallVector.h"
+#include "llvm/IR/IRBuilder.h"
+#include "llvm/IR/InstVisitor.h"
+#include "llvm/InitializePasses.h"
+#include "llvm/Pass.h"
+#include "llvm/Support/CommandLine.h"
+#include "llvm/Transforms/Scalar.h"
+
+using namespace llvm;
+
+#define DEBUG_TYPE "packedintcombine"
+
+static cl::opt<unsigned> MaxCollectionIterations(
+    "packedint-max-iterations",
+    cl::desc("Maximum number of iterations to isolate final packed "
+             "instructions. Set to 0 to iterate until convergence."),
+    cl::init(2), cl::Hidden);
+
+static cl::opt<bool>
+    AggressiveRewriting("packedint-aggressive-rewriter",
+                        cl::desc("Aggressively rewrite packed instructions."),
+                        cl::init(false), cl::Hidden);
+
+namespace {
+
+/// Reference to either a constant byte, or a byte extracted from an IR value.
+class Byte {
+  /// The base value from which the byte is obtained.
+  Value *Base;
+
+  /// If the base value is not null, then this holds the index of the byte
+  /// being used, where 0 is the least significant byte.
+  /// Otherwise, this is treated as a constant byte.
+  unsigned Integer;
+
+public:
+  static constexpr unsigned BitWidth = 8;
+  static constexpr unsigned AllOnes = 0xff;
+
+  /// Construct a byte from a well-defined IR value.
+  explicit Byte(Value &Base, unsigned Index) : Base(&Base), Integer(Index) {}
+
+  /// Construct a constant byte.
+  explicit Byte(unsigned Constant) : Base(nullptr), Integer(Constant) {
+    assert(Constant <= AllOnes && "Constant is too large to fit in a byte.");
+  }
+
+  /// Construct a constant byte that is fully set.
+  static Byte ones() { return Byte(Byte::AllOnes); }
+  /// Construct the zero byte.
+  static Byte zeroes() { return Byte(0); }
+
+  /// Indicate whether the byte is a known integer constant.
+  /// Note that poison or undef base values are not recognised as constant.
+  bool isConstant() const { return !Base; }
+
+  /// Get the constant byte value.
+  unsigned getConstant() const {
+    assert(isConstant() && "Expected a constant byte.");
+    return Integer;
+  }
+
+  /// Get the base IR value from which this byte is obtained.
+  Value *getBase() const {
+    assert(!isConstant() && "Byte constants do not have a base value.");
+    return Base;
+  }
+
+  /// Get the byte offset of the IR value referenced by the byte.
+  unsigned getIndex() const {
+    assert(!isConstant() && "Byte constants are not indexed.");
+    return Integer;
+  }
+
+  bool operator==(const Byte &Other) const {
+    return Base == Other.Base && Integer == Other.Integer;
+  }
+
+  void print(raw_ostream &ROS, bool NewLine = true) const {
+    if (isConstant())
+      ROS << "const";
+    else
+      Base->printAsOperand(ROS, false);
+
+    ROS << "[" << Integer << "]";
+
+    if (NewLine)
+      ROS << "\n";
+  }
+
+  LLVM_DUMP_METHOD void dump() const { print(errs(), true); }
+};
+
+inline raw_ostream &operator<<(raw_ostream &ROS, const Byte &B) {
+  B.print(ROS, false);
+  return ROS;
+}
+
+/// Convenience data structure for describing the layout of bytes for vector and
+/// integer types, treating integer types as singleton vectors.
+struct ByteLayout {
+  /// The number of bytes that fit in a single element.
+  unsigned NumBytesPerElement;
+  /// The number of vector elements (or 1, if the type is an integer type).
+  unsigned NumVecElements;
+
+  /// Get the total number of bytes held by the vector or integer type.
+  unsigned getNumBytes() const { return NumBytesPerElement * NumVecElements; }
+};
+
+/// Interpret the given type as a number of packed bytes, if possible.
+static std::optional<ByteLayout> tryGetByteLayout(const Type *Ty) {
+  unsigned IntBitWidth, NumElts;
+  if (const auto *IntTy = dyn_cast<IntegerType>(Ty)) {
+    IntBitWidth = IntTy->getBitWidth();
+    NumElts = 1;
+  } else if (const auto *VecTy = dyn_cast<FixedVectorType>(Ty)) {
+    const auto *IntTy = dyn_cast<IntegerType>(VecTy->getElementType());
+    if (!IntTy)
+      return std::nullopt;
+    IntBitWidth = IntTy->getBitWidth();
+    NumElts = VecTy->getNumElements();
+  } else
+    return std::nullopt;
+
+  if (IntBitWidth % Byte::BitWidth != 0)
+    return std::nullopt;
+
+  return ByteLayout{IntBitWidth / Byte::BitWidth, NumElts};
+}
+
+/// Interpret the given type as a number of backed bytes (aborts if impossible).
+static ByteLayout getByteLayout(const Type *Ty) {
+  const std::optional<ByteLayout> Layout = tryGetByteLayout(Ty);
+  assert(Layout);
+  return *Layout;
+}
+
+/// A convenience class for combining Byte instances obtained from the same base
+/// value, and with a common relative offset, which can hence be obtained
+/// simultaneously.
+struct CoalescedBytes {
+  /// The value from which the coalesced bytes are all derived. This pointer is
+  /// never null.
+  Value *Base;
+  /// The number of bytes to shift right to align the coalesced bytes with the
+  /// target value.
+  ///
+  /// For instance, if bytes 3, 4, 5 of some value %val are coalesced to provide
+  /// bytes 0, 1, 2 of the target %tgt, then ShrByteOffset = 3.
+  signed SignedShrByteOffset;
+  /// The bitmask identifying which bytes of the target value are covered by
+  /// these coalesced bytes.
+  ///
+  /// For instance, if bytes 3, 4, 5 of some value %val are coalesced to provide
+  /// bytes 0, 1, 2 of the target %tgt, then this mask's first three bits will
+  /// be set, corresponding to the first three bits of %tgt.
+  SmallBitVector Mask;
+
+  explicit CoalescedBytes(Value &Base, signed Offset, SmallBitVector Mask)
+      : Base(&Base), SignedShrByteOffset(Offset), Mask(Mask) {}
+  explicit CoalescedBytes(Value &Base, signed Offset, unsigned NumBytes)
+      : Base(&Base), SignedShrByteOffset(Offset), Mask(NumBytes) {}
+
+  bool alignsWith(Value *V, signed VOffset) const {
+    return Base == V && SignedShrByteOffset == VOffset;
+  }
+
+  /// Get the number of bytes to shift the base value right to align with the
+  /// target value.
+  unsigned getShrBytes() const { return std::max(0, SignedShrByteOffset); }
+
+  /// Get the number of bytes to shift the base value left to align with the
+  /// target value.
+  unsigned getShlBytes() const { return std::max(0, -SignedShrByteOffset); }
+
+  /// Get the number of bits to shift the base value right to align with the
+  /// target value.
+  unsigned getShrBits() const { return getShrBytes() * Byte::BitWidth; }
+
+  /// Get the number of bits to shift the base value left to align with the
+  /// target value.
+  unsigned getShlBits() const { return getShlBytes() * Byte::BitWidth; }
+
+  void print(raw_ostream &ROS, bool NewLine = true) const {
+    ROS << "{ ";
+    for (unsigned Idx = 0; Idx < Mask.size(); ++Idx) {
+      if (Mask.test(Idx)) {
+        Base->printAsOperand(ROS, false);
+        ROS << "[" << (static_cast<int>(Idx) + SignedShrByteOffset) << "]";
+      } else
+        ROS << 0;
+
+      ROS << "; ";
+    }
+    ROS << "}";
+
+    if (NewLine)
+      ROS << "\n";
+  }
+
+  LLVM_DUMP_METHOD void dump() const { print(errs(), true); }
+};
+
+inline raw_ostream &operator<<(raw_ostream &ROS, const CoalescedBytes &CB) {
+  CB.print(ROS, false);
+  return ROS;
+}
+
+/// Association of a Byte (constant or byte extracted from an LLVM Value) to the
+/// operand(s) responsible for producing it. A value of ByteUse::AllOperands
+/// (-1) indicates that all operands are responsible for producing the given
+/// byte.
+class ByteUse {
+
+  Byte B;
+  int OpIdx;
+
+public:
+  /// Sentinel value representing that all operands are responsible for the
+  /// given Byte.
+  static constexpr int AllOperands = -1;
+
+  ByteUse(Byte B, int OpIdx) : B(B), OpIdx(OpIdx) {}
+
+  const Byte &getByte() const { return B; }
+  int getOperandIndex() const { return OpIdx; }
+
+  bool operator==(const ByteUse &BU) const {
+    return BU.B == B && BU.OpIdx == OpIdx;
+  }
+};
+
+using ByteVector = SmallVector<ByteUse, 8>;
+
+/// The decomposition of an IR value into its individual bytes, tracking where
+/// each byte is obtained.
+class ByteDefinition {
+  /// Enum classifying what Ptr points to.
+  enum ByteType : uint8_t {
+    /// Ptr's value is undefined.
+    INVALID,
+    /// The byte definition is given by a ByteVector, which is referenced (but
+    /// not captured) by Ptr.
+    VECTOR,
+    /// The bytes are obtained from a (currently opaque) IR value, held by Ptr.
+    VALUE,
+    /// The bytes are obtained from a constant integer, held by Ptr.
+    CONST_INT,
+    /// The bytes are obtained from a constant vector of integers, held by Ptr.
+    CONST_VEC,
+  };
+
+  ByteType DefType;
+  void *Ptr;
+  ByteLayout Layout;
+  ByteDefinition(ByteType DefType, void *Ptr, ByteLayout Layout)
+      : DefType(DefType), Ptr(Ptr), Layout(Layout) {}
+
+public:
+  /// Indicate that a value cannot be decomposed into bytes in a known way.
+  static ByteDefinition invalid() { return {INVALID, nullptr, {0, 0}}; }
+  /// Indicate that a value's bytes are known, and track their producers.
+  static ByteDefinition vector(ByteVector &Ref, ByteLayout Layout) {
+    return {VECTOR, &Ref, Layout};
+  }
+  /// Indicate that a value's bytes are opaque.
+  static ByteDefinition value(Value &V) {
+    return {VALUE, &V, getByteLayout(V.getType())};
+  }
+  /// Indicate that the bytes come from a constant integer.
+  static ByteDefinition constInt(ConstantInt &Int) {
+    return {CONST_INT, &Int, getByteLayout(Int.getType())};
+  }
+  /// Indicate that the bytes come from a constant vector of integers.
+  static ByteDefinition constVec(Constant &Vec) {
+    assert(Vec.getType()->isVectorTy());
+    return {CONST_VEC, &Vec, getByteLayout(Vec.getType())};
+  }
+
+  ByteVector &getVector() const {
+    assert(DefType == VECTOR);
+    return *static_cast<ByteVector *>(Ptr);
+  }
+  Value &getValue() const {
+    assert(DefType == VALUE);
+    return *static_cast<Value *>(Ptr);
+  }
+  ConstantInt &getConstInt() const {
+    assert(DefType == CONST_INT);
+    return *static_cast<ConstantInt *>(Ptr);
+  }
+  Constant &getConstVec() const {
+    assert(DefType == CONST_VEC);
+    return *static_cast<Constant *>(Ptr);
+  }
+
+  bool isValid() const { return DefType != INVALID; }
+
+  /// Return true iff the byte definition is valid.
+  operator bool() const { return isValid(); }
+
+  /// Get the definition of the byte at the specified byte offset, where 0 is
+  /// the least significant byte.
+  Byte getByte(unsigned Idx) const {
+    switch (DefType) {
+    default:
+      llvm_unreachable("Invalid byte definition");
+    case VECTOR:
+      return getVector()[Idx].getByte();
+    case VALUE:
+      return Byte(getValue(), Idx);
+    case CONST_INT:
+      return Byte(getConstInt().getValue().extractBitsAsZExtValue(
+          Byte::BitWidth, Idx * Byte::BitWidth));
+    case CONST_VEC: {
+      const auto &Vec = getConstVec();
+      const ByteLayout Layout = getByteLayout(Vec.getType());
+      const unsigned VecIdx = Idx / Layout.NumBytesPerElement;
+      const unsigned EltIdx = Idx % Layout.NumBytesPerElement;
+
+      Constant *Elt = Vec.getAggregateElement(VecIdx);
+      if (const auto *Int = dyn_cast<ConstantInt>(Elt))
+        return Byte(Int->getValue().extractBitsAsZExtValue(
+            Byte::BitWidth, EltIdx * Byte::BitWidth));
+
+      return Byte(*Elt,...
[truncated]

Copy link

⚠️ undef deprecator found issues in your code. ⚠️

You can test this locally with the following command:
git diff -U0 --pickaxe-regex -S '([^a-zA-Z0-9#_-]undef[^a-zA-Z0-9_-]|UndefValue::get)' 'HEAD~1' HEAD llvm/include/llvm/Transforms/Scalar/PackedIntegerCombinePass.h llvm/lib/Transforms/Scalar/PackedIntegerCombinePass.cpp llvm/test/Transforms/PackedIntegerCombine/instructions.ll llvm/test/Transforms/PackedIntegerCombine/int2int.ll llvm/test/Transforms/PackedIntegerCombine/int2vec.ll llvm/test/Transforms/PackedIntegerCombine/vec2int.ll llvm/test/Transforms/PackedIntegerCombine/vec2vec.ll llvm/include/llvm/InitializePasses.h llvm/include/llvm/Transforms/Scalar.h llvm/lib/Passes/PassBuilder.cpp llvm/lib/Passes/PassBuilderPipelines.cpp llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp llvm/lib/Transforms/Scalar/Scalar.cpp llvm/test/CodeGen/AMDGPU/combine-vload-extract.ll llvm/test/CodeGen/AMDGPU/llc-pipeline.ll llvm/test/CodeGen/AMDGPU/sdwa-peephole.ll llvm/test/CodeGen/AMDGPU/splitkit-getsubrangeformask.ll llvm/test/Other/new-pm-defaults.ll llvm/test/Other/new-pm-thinlto-postlink-defaults.ll llvm/test/Other/new-pm-thinlto-postlink-pgo-defaults.ll llvm/test/Other/new-pm-thinlto-postlink-samplepgo-defaults.ll llvm/test/Other/new-pm-thinlto-prelink-defaults.ll llvm/test/Other/new-pm-thinlto-prelink-pgo-defaults.ll llvm/test/Other/new-pm-thinlto-prelink-samplepgo-defaults.ll

The following files introduce new uses of undef:

  • llvm/test/CodeGen/AMDGPU/splitkit-getsubrangeformask.ll

Undef is now deprecated and should only be used in the rare cases where no replacement is possible. For example, a load of uninitialized memory yields undef. You should use poison values for placeholders instead.

In tests, avoid using undef and having tests that trigger undefined behavior. If you need an operand with some unimportant value, you can add a new argument to the function and use that instead.

For example, this is considered a bad practice:

define void @fn() {
  ...
  br i1 undef, ...
}

Please use the following instead:

define void @fn(i1 %cond) {
  ...
  br i1 %cond, ...
}

Please refer to the Undefined Behavior Manual for more information.

@zGoldthorpe
Copy link
Contributor Author

The undef deprecator detected is coming from an old test, though it also seems that undefs are only present in the FileCheck comments. The closest actual tests get are %undef vregs that are defined to be frozen poison values.

Copy link
Contributor

@arsenm arsenm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs a lot more phase ordering justification to introduce a new pass, especially in the middle end runs. The comment suggests it's to address a pattern introduced by LoadStoreVectorizer. This needs some phase ordering examples (particular for the new middle end runs, which I'm assuming we really do not need), and some for the codegen pipeline. There's only one codegen test change that's a regression, and the others are likely test breaking.

Most of the new tests look non-canonical, and have the same output as instcombine. One of the examples is an interesting (for AMDGPU) missed vectorization. The extract_i32 example is interesting, but I don't see why VectorCombine wouldn't be the place to handle it

; NOSDWA-NEXT: v_mov_b32_e32 v2, s2
; NOSDWA-NEXT: v_mov_b32_e32 v3, s3
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All these changes look like the point of the test was lost

ret i64 %out
}

define i64 @combine(i32 %bot, i32 %top) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This case is a little interesting because other targets vectorize, but we do not. We could equivalently allow vectorization of 2 x i32 bit operations. Should look into fixing the cost model after #140694

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your detailed feedback.
Is vectorisation really the right move in this case, as opposed to zext + shl? (This is a case of particular interest.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the target perspective, it doesn't really matter. In the IR, the shift and zext is more canonical. In the backend we end up turning all of these into 32-bit vector operations anyway

Copy link
Contributor Author

@zGoldthorpe zGoldthorpe Jul 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've made a new PR to canonicalise this particular pattern in #147737

; RUN: opt -S -passes=packedintcombine %s | FileCheck %s --check-prefix=LAZY
; RUN: opt -S -passes=packedintcombine -packedint-aggressive-rewriter %s | FileCheck %s --check-prefix=AGGRESSIVE

define i32 @extract_i32(<4 x i8> %from) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why can't VectorCombine deal with this case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Taking this suggestion in #147414

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants