Skip to content

Conversation

HereThereBeDragons
Copy link

Motivation

Commit ROCm/TheRock@267c4d9 enabled to build hostexec but it does not build with cmake 4. However, TheRock promises since ROCm/TheRock#1440 to build with cmake 4 and as such this needs to be fixed.

Technical Details

Bump the minimum required cmake version for hostexec from 3.0 to 3.20.0 to enable building with cmake4. This is the same minimum required version as the parent directory "offload" uses.

jmmartinez and others added 30 commits March 20, 2025 11:12
The underlying issue was that I forgot to clean the cache directory
before running the test. So the test ended up running sometimes on a
dirty cache yielding bad fails.

Since the code is only running a single comgr action that only converts
spirv->bc, the contents of the cache should be 2 files:
* the bitcode
* the cache timestamp
In this patch, we add a new action:

AMD_COMGR_ACTION_COMPILE_SPIRV_TO_RELOCATABLE

That accepts a set of .spv files, translates them to .bc files,
extracts any embedded @llvm.cmdline flags, and then compiles
to a set of relocatable .o files.
The underlying issue was that I forgot to clean the cache directory
before running the test. So the test ended up running sometimes on a
dirty cache yielding bad fails.

Since the code is only running a single comgr action that only converts
spirv->bc, the contents of the cache should be 2 files:
* the bitcode
* the cache timestamp
…lvm#1365)

from row #19 in "Mainline for 6.5 Cherry-pick List"
amd-staging commits:
[Comgr][Cache] Fix broken test: spirv-translator-cached.cl · 51fa25b
[Cache][SPIRV] Fix flacky test... again · 56cf45a
Note that this is not an NFC change because the test case
`llvm/test/CodeGen/AMDGPU/amdgpu-spill-cfi-saved-regs.ll` has been
updated due
to the recent SGPR layout change. The 32 CSR SGPRs in
`callee_need_to_spill_fp_exec_to_memory` have been adjusted to reflect
this
update.

Change-Id: I332a721e7e8feaa5491c63228ecb42759e4d979d
This PR updates the SGPR layout to a striped caller/callee-saved design,
similar
to the VGPR layout.

To ensure that s30-s31 (return address), s32 (stack pointer), s33 (frame
pointer), and s34 (base pointer) remain callee-saved, the striped layout
starts
from s40, with a stripe width of 8. The last stripe is 10 wide instead
of 8 to
avoid ending with a 2-wide stripe.

Fixes llvm#113782.

Change-Id: I6fe8fca8b70985a8775ec04d93b460333533d2bb
For hipBinNVPtr_ and hipBinAMDPtr_ members: the destructor of the base
class was not marked as virtual, but the destructor of the derived
classes are. When we delete the object we do it through a
pointer to the base class. So the base class destructor is called but
not the one of the derived classes. This results in strange memory
behaviour detected by ASAN.

Solves SWDEV-516418
Also archive the Comgr V3 Release notes, and start a new document
for Comgr V4 changes.

Change-Id: I25137c174bd70caafe9b3c26d3a956331e0e9dfc
choikwa and others added 23 commits August 5, 2025 16:08
…126058) (llvm#3162)

GlobalISel already handles undefined workitem.id.{x,y,z} intrinsics,
SelDAG failed in AMDGPUISelLowering.cpp due to a failed assertion in
`AMDGPUTargetLowering::loadInputValue`: `Arg && "Attempting to load
missing argument"`. This commit changes the behavior of SelDAG to
instead use a zero constant.

This LLVM defect was identified via the AMD Fuzzing project.

Cherry-picked from bcba311

Fixes "Arg && "Attempting to load missing argument" assert in Numba from
SWDEV-543227

Co-authored-by: Robert Imschweiler <50044286+ro-i@users.noreply.github.com>
HIP runtime support for compressed bundle format v3 is in place,
therefore switch the default compressed bundle format to v3
in compiler.

This allows both compressed and decompressed fat binary size
to exceed 4GB by default.

Environment variable COMPRESSED_BUNDLE_FORMAT_VERSION=2 can
be used for backward compatibility for older HIP runtimes
not supporting v3.

Fixes: SWDEV-548879
…t_fail() (llvm#144886) (llvm#3189)

Modifications to reapply the commit:
* Add noexcept only after C++11 on __glibcxx_assert_fail
* Remove vararg version of __glibcxx_assert_fail

And doc CP.

Issue
[SWDEV-518041](https://ontrack-internal.amd.com/browse/SWDEV-518041)
& doc task
[SWDEV-538485](https://ontrack-internal.amd.com/browse/SWDEV-538485)

---------

Co-authored-by: Juan Manuel Martinez Caamaño <jmartinezcaamao@gmail.com>
llvm#3457)…llvm#129037)

When a read(first)lane is used on a binary operator and the intrinsic is
the only user of the operator, we can move the read(first)lane into the
operand if the other operand is uniform.

Unfortunately IC doesn't let us access UniformityAnalysis and thus we
can't truly check uniformity, we have to do with a basic uniformity
check which only allows constants or trivially uniform intrinsics calls.

We can also do the same for unary and cast operators.

Co-authored-by: Pierre van Houtryve <pierre.vanhoutryve@amd.com>
…#3749)

The workaround will be active only if the system doesn't have pcie
atomics

Co-authored-by: Andryeyev, German <German.Andryeyev@amd.com>
…tributor run (llvm#155246) (llvm#3772)

We do not need this in the attributor, because `ST.getWavesPerEU`
accounts for both the waves-per-eu and flat-workgroup-size attributes.
If the waves-per-eu values are not valid, it drops them. In the
attributor, we only need to propagate the values without using
intermediate flat workgroup size values.

Fixes SWDEV-550257.

(cherry picked from commit ca03045)
…d integers. (llvm#3581)

This patch extends the instruction combiner to simplify the construction
of a packed scalar integer from a vector type, such as:
```llvm
target datalayout = "e"

define i32 @src(<4 x i8> %v) {
  %v.0 = extractelement <4 x i8> %v, i32 0
  %z.0 = zext i8 %v.0 to i32

  %v.1 = extractelement <4 x i8> %v, i32 1
  %z.1 = zext i8 %v.1 to i32
  %s.1 = shl i32 %z.1, 8
  %x.1 = or i32 %z.0, %s.1

  %v.2 = extractelement <4 x i8> %v, i32 2
  %z.2 = zext i8 %v.2 to i32
  %s.2 = shl i32 %z.2, 16
  %x.2 = or i32 %x.1, %s.2

  %v.3 = extractelement <4 x i8> %v, i32 3
  %z.3 = zext i8 %v.3 to i32
  %s.3 = shl i32 %z.3, 24
  %x.3 = or i32 %x.2, %s.3

  ret i32 %x.3
}

; ===============

define i32 @tgt(<4 x i8> %v) {
  %x.3 = bitcast <4 x i8> %v to i32
  ret i32 %x.3
}
```

Alive2 proofs (little-endian): [YKdMeg](https://alive2.llvm.org/ce/z/YKdMeg)
Alive2 proofs (big-endian): [vU6iKc](https://alive2.llvm.org/ce/z/vU6iKc)
Co-authored-by: Amit Kumar Pandey <137622562+ampandey-1995@users.noreply.github.com>
Co-authored-by: Hans Wennborg <hans@chromium.org>
Co-authored-by: Amit Pandey <pandey.kumaramit2023@gmail.com>
llvm#3870)

…(llvm#3208)

'hsa_vmem_address_free'.

Implement interception of 'hsa_amd_vmem_address_reserve_align' and
'hsa_vmem_address_free' so as to support ASan overflow errors for memory
allocated via 'hipMallocManaged'.

[Ticket: SWDEV-483895]

---------

Co-authored-by: Amit Pandey <pandey.kumaramit2023@gmail.com>
Due to a botched merge, we currently emit volatile loads from feature
predicate globals. These are never foldable, which breaks things. This
does not apply to the upstream patch currently under review.

Commiting on behalf of github user @AlexVlx
llvm#3577) ...(llvm#131167)

Fixes SWDEV-514946

Co-authored-by: Emma Pilkington <emma.pilkington95@gmail.com>
…lvm#3748)

This along with IntrReadMem means that the Intrinsic only reads memory
through the given argument ptr and its derivatives. This allows passes
like Inliner to attach alias.scope to the call instruction as it sees
that no other memory is accessed.

Discovered via SWDEV-543741

---------

Co-authored-by: Matt Arsenault <arsenm2@gmail.com>

Cherry-picked from 1d30f71

---------

Co-authored-by: choikwa <5455710+choikwa@users.noreply.github.com>
…lvm#4011)

Restrict to VGPR only (VRegSrc_32) for mfma scale operands to workaround
a hardware design defect: For all Inline/SGPR constants, SP HW use bits
[30:23] as the scale.

TODO: We may still be able to allow Inline Constants/SGPR, with a proper
shift, to obtain a potentially better performance.

Fixes: SWDEV-548629
Co-authored-by: Thao, Vang <Vang.Thao@amd.com>
Add reference to ROCm compiler reference, remove unused test file update
link in ENV topic
Bump the minimum required cmake version from 3.0 to 3.20.0 to enable
building with cmake4. This is the same minimum required version as the
parent directory "offload" uses.
@lamikr
Copy link

lamikr commented Sep 25, 2025

I tested with the amd-llvm version and without this patch the build with cmake 4.1.0 would produce a following error:

0.8     -- Building the llvm-omp-kernel-replay tool
0.8     CMake Error at /home/lamikr/own/rock/src/sdk/therock_gfx1100_v2/compiler/amd-llvm/offload/hostexec/CMakeLists.txt:13 (cmake_minimum_required):
0.8       Compatibility with CMake < 3.5 has been removed from CMake.
0.8
0.8       Update the VERSION argument <min> value.  Or, use the <min>...<max> syntax
0.8       to tell CMake that the project requires at least <min> but has been updated
0.8       to work with policies introduced by <max> or earlier.
0.8
0.8       Or, add -DCMAKE_POLICY_VERSION_MINIMUM=3.5 to try configuring anyway.
0.8
0.8
0.8     -- Configuring incomplete, errors occurred!
0.9     FAILED: runtimes/runtimes-stamps/runtimes-configure /home/lamikr/own/rock/src/sdk/therock_gfx1100_v2/build/compiler/amd-llvm/build/runtimes/runtimes-stamps/runtimes-configure

When this patch is applied, llvm-build worked both with the cmake 3.28.3 and with the cmake 4.1.0.

226.9   -- Building the llvm-omp-kernel-replay tool
226.9   -- Building hostexec for AMDGCN linked against libhsa
226.9   -- HSA Runtime found: /home/lamikr/own/rock/src/sdk/therock_gfx1100_v2/build/compiler/amd-llvm/build/runtimes/rocr-runtime-prefix/src/rocr-runtime-build/rocr/lib/libhsa-runtime64.so
226.9   -- HSA Runtime include: /home/lamikr/own/rock/src/sdk/therock_gfx1100_v2/rocm-systems/projects/rocr-runtime/ annruntime/hsa-runtime/inc
226.9   -- Not building hostexec for NVPTX because cuda not found
226.9      -- Building hostexec with LLVM 20.0.0git found with CLANG_TOOL /home/lamikr/own/rock/src/sdk/therock_gfx1100_v2/build/compiler/amd-llvm/build/bin/clang
226.9   -- Building DeviceRTL. Using clang: /home/lamikr/own/rock/src/sdk/therock_gfx1100_v2/build/compiler/amd-llvm/build/bin/clang, llvm-link: /home/lamikr/own/rock/src/sdk/therock_gfx1100_v2/build/compiler/amd-llvm/build/bin/llvm-link and opt: /home/lamikr/own/rock/src/sdk/therock_gfx1100_v2/build/compiler/amd-llvm/build/bin/opt
226.9   -- Building offloading runtime library libomptarget.
226.9   -- Configuring done (0.5s)
227.2   -- Generating done (0.3s)
227.2   -- Build files have been written to: /home/lamikr/own/rock/src/sdk/therock_gfx1100_v2/build/compiler/amd-llvm/build/runtimes/runtimes-bins

After that I tested the the llvm-version build both with the cmake 3.28.3 and 4.1.0 to build the rest of the rocm-stack and pytorch and kernel loading to AMD gpu's worked ok with my pytorch and triton test spps.

cmake version does not probably affect to lit-test, so they are relevant for this. They passed anyway for command:

lit amd-llvm/openmp/runtime/test/ompt

Not sure how to do more testing for this one.

@marbre
Copy link
Member

marbre commented Sep 25, 2025

Not sure how to do more testing for this one.

There is no need for further testing on our side. The LLVM team has picked this up and are on it but this will go to an internal repo first.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.