Skip to content

Conversation

tomflinda
Copy link
Contributor

@tomflinda tomflinda commented Feb 10, 2025

The pulldown includes four PRs below to fix IP Leak issues
[AVX10.2] Update convert chapter intrinsic and mnemonics names (#123656)
[X86][AVX10.2-SATCVT][NFC] Remove NE from intrinsic and instruction name (#123275)
[X86][AVX10.2-MINMAX][NFC] Remove NE[P] from intrinsic and instruction (#123272)
[X86][AMX-AVX512][NFC] Remove P from intrinsic and instruction name (#123270)

Signed-off-by: chenwei.sun chenwei.sun@intel.com

dmpolukhin and others added 30 commits January 23, 2025 10:35
Summary:
This PR fixes bugreport
llvm/llvm-project#122493 The root problem is
the same as before lambda function and DeclRefExpr references a variable
that does not belong to the same module as the enclosing function body.
Therefore iteration over the function body doesn’t visit the VarDecl.
Before this change RelatedDeclsMap was created only for canonical decl
but in reality it has to be done for the definition of the function that
does not always match the canonical decl.

Test Plan: check-clang
…730)

oneapi-src/unified-runtime#2193

---------

Co-authored-by: Kenneth Benzie (Benie) <k.benzie@codeplay.com>
…… (#16737)

…ions.

These functions were wrongly defined to take addrspace(1). They take
generic pointers which are not annotated with any explicit address
space.
Intrinsics are available for the 'cpSize'
variants also. So, this patch migrates the Op
to lower to the intrinsics for all cases.

* Update the existing tests to check the lowering to intrinsics.
* Add newer cp_async_zfill tests to verify the lowering for the 'cpSize'
   variants.
* Tidy-up CHECK lines in cp_async() function in nvvmir.mlir (NFC)

PTX spec link:
https://docs.nvidia.com/cuda/parallel-thread-execution/#data-movement-and-conversion-instructions-cp-async

Signed-off-by: Durgadoss R <durgadossr@nvidia.com>
We use variable locations such as DBG_VALUE $xmm0 as shorthand to refer
to "the low lane of $xmm0", and this is reflected in how DWARF is
interpreted too. However InstrRefBasedLDV tries to be smart and
interprets such a DBG_VALUE as a 128-bit reference. We then issue a
DW_OP_deref_size of 128 bits to the stack, which isn't permitted by
DWARF (it's larger than a pointer).

Solve this for now by not using DW_OP_deref_size if it would be illegal.
Instead we'll use DW_OP_deref, and the consumer will load the variable
type from the stack, which should be correct.

There's still a risk of imprecision when LLVM decides to use smaller or
larger value types than the source-variable type, which manifests as
too-little or too-much memory being read from the stack. However we
can't solve that without putting more type information in debug-info.

fixes #64093
… object parameters (#124096)

LLDB deduces the CV-qualifiers and storage class of a C++ method from
the object parameter. Currently it assumes that parameter is implicit
(and is a pointer type with the name "this"). This isn't true anymore in
C++23 with explicit object parameters. To support those we can simply
check the `DW_AT_object_pointer` of the subprogram DIE (works for both
declarations and definitions) when searching for the object parameter.

We can also remove the check for `eEncodingIsPointerUID`, because in C++
an artificial parameter called `this` is only ever the implicit object
parameter (at least for all the major compilers).
This patch adds NVVM intrinsics and NVPTX codegen for:

- cp.async.bulk.prefetch.L2.* variants 
- These intrinsics optionally support cache_hints as indicated by the
   boolean flag argument.
- Lit tests are added for all combinations of these intrinsics in
   cp-async-bulk.ll.
- The generated PTX is verified with a 12.3 ptxas executable.
- Added docs for these intrinsics in NVPTXUsage.rst file.

PTX Spec reference:
https://docs.nvidia.com/cuda/parallel-thread-execution/#data-movement-and-conversion-instructions-cp-async-bulk-prefetch


Co-authored-by: abmajumder <abmajumder@nvidia.com>
…explicit object parameters" (#124100)

Reverts llvm/llvm-project#124096

Broke linux CI:
```
Note: This is test shard 7 of 42.
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from DWARFASTParserClangTests
[ RUN      ] DWARFASTParserClangTests.TestParseSubroutine_ExplicitObjectParameter
Expected<T> must be checked before access or destruction.
Expected<T> value was in success state. (Note: Expected<T> values in success mode must still be checked prior to being destroyed).
Stack dump without symbol names (ensure you have llvm-symbolizer in your PATH or set the environment var `LLVM_SYMBOLIZER_PATH` to point to it):
0  SymbolFileDWARFTests 0x0000560271ee5ba7
1  SymbolFileDWARFTests 0x0000560271ee3a2c
2  SymbolFileDWARFTests 0x0000560271ee63ea
3  libc.so.6            0x00007f3e54e5b050
4  libc.so.6            0x00007f3e54ea9e2c
5  libc.so.6            0x00007f3e54e5afb2 gsignal + 18
6  libc.so.6            0x00007f3e54e45472 abort + 211
7  SymbolFileDWARFTests 0x0000560271e79d51
8  SymbolFileDWARFTests 0x0000560271e724f7
9  SymbolFileDWARFTests 0x0000560271f39e2c
10 SymbolFileDWARFTests 0x0000560271f3b368
11 SymbolFileDWARFTests 0x0000560271f3c053
12 SymbolFileDWARFTests 0x0000560271f4cf67
13 SymbolFileDWARFTests 0x0000560271f4c18a
14 SymbolFileDWARFTests 0x0000560271f2561c
15 libc.so.6            0x00007f3e54e4624a
16 libc.so.6            0x00007f3e54e46305 __libc_start_main + 133
17 SymbolFileDWARFTests 0x0000560271e65161
```
Nothing in VPlan.h directly uses VPBlockUtils.h. Move it out to the more
appropriate VPlanUtils.h to reduce the size of the widely included VPlan.h.
…24101)

Reverts llvm/llvm-project#123393

This is causing `TestVectorOfVectorsFromStdModule.py` to fail on the the
macOS clang-15 matrix bot.
Prevents avoidable memory leaks.

Looks like exchange added in aa1333a
didn't take "continue" into account.

```
==llc==2150782==ERROR: LeakSanitizer: detected memory leaks
Direct leak of 10 byte(s) in 1 object(s) allocated from:
    #0 0x5f1b0f9ac14a in strdup llvm-project/compiler-rt/lib/asan/asan_interceptors.cpp:593:3
    oneapi-src#1 0x5f1b1768428d in FileToRemoveList llvm-project/llvm/lib/Support/Unix/Signals.inc:105:55
```
…explicit object parameters" (#124100)"

This reverts commit a802093.

Relands original commit but fixing the unit-test to consume the
`llvm::Expected` error object.
Using a "random" name for an "anonymous" pipe seems to be the state of
the art on windows (according to stack overflow, new windows versions
may have something better, but it involves calling kernel APIs directly
and generally a lot of dark magic).

The problem with the current method was that is does not produce unique
names if one has two copies of the pipe code in the same process, which
is what happened with #120457 (because liblldb only exposes the public
api, and we've started using the pipe code in lldb-dap as well).

This patch works around the problem by adding the address of the counter
variable to the pipe name.

Replicating the multiple-copies setup in a test would be very difficult,
which is why I'm not adding a test for this scenario.
.. by changing the signal stop reason format 🤦

The reason this did not work is because the code in
`StopInfo::GetCrashingDereference` was looking for the string "address="
to extract the address of the crash. Macos stop reason strings have the
form
```
  EXC_BAD_ACCESS (code=1, address=0xdead)
```
while on linux they look like:
```
  signal SIGSEGV: address not mapped to object (fault address: 0xdead)
```

Extracting the address from a string sounds like a bad idea, but I
suppose there's some value in using a consistent format across
platforms, so this patch changes the signal format to use the equals
sign as well. All of the diagnose tests pass except one, which appears
to fail due to something similar #115453 (disassembler reports
unrelocated call targets).

I've left the tests disabled on windows, as the stop reason reporting
code works very differently there, and I suspect it won't work out of
the box. If I'm wrong -- the XFAIL will let us know.
UR PR: oneapi-src/unified-runtime#2605

---------

Co-authored-by: Kenneth Benzie (Benie) <k.benzie@codeplay.com>
…pecializations (#118167)

Some templates in the standard library are illegal to specialize for users
(even if the specialization contains user-defined types). The [[clang::no_specializations]]
attribute allows marking such base templates so that the compiler will
diagnose if users try adding a specialization.
There were two implementations of this - one that implemented nextafter
in software, and another that called a clang builtin. No in-tree targets
called the builtin, so all targets build the software version. The
builtin version has been removed, and the software version has been
renamed to be the "default".

This commit also optimizes nextafter, to avoid scalarization as much as
possible. Note however that the (CLC) relational builtins still
scalarize; those will be optimized in a separate commit.

Since nextafter is used by some convert_type builtins, the diff to IR
codegen is not limited to the builtin itself.
The code that checks a predicate against a swapped predicate in
isImpliedCondBalancedTypes is not covered by any existing test, within
any Analysis or Transform. Fix this by adding a test to SCEV.
Support reductions in SCFToGPU: `scf.parallel` and `scf.reduce` op
combination is now converted to a `gpu.all_reduce` op.
Intel docs have been updated to be similar to AMD and now describe
BSF/BSR as not changing the destination register if the input value was
zero, which allows us to support CTTZ/CTLZ zero-input cases by setting
the destination to support a NumBits result (BSR is a bit messy as it
has to be XOR'd to create a CTLZ result). VIA/Zhaoxin x86_64 CPUs have also
been confirmed to match this behaviour.

This patch adjusts the X86ISD::BSF/BSR nodes to take a "pass through"
argument for zero-input cases, by default this is set to UNDEF to match
existing behaviour, but it can be set to a suitable value if supported.

There are still some limits to this - its only supported for x86_64
capable processors (and I've only enabled it for x86_64 codegen), and
Intel CPUs sometimes zero the upper 32-bits of a pass through register
when used for BSR32/BSF32 with a zero source value (i.e. the whole
64bits may not get passed through).

Fixes #122004
Split target under LLVMIR/Transforms to avoid deps loop.
This code was using a pre-move-semantics trick of using std::swap to
avoid expensive vector copies.
This is meant as a short-term workaround for an invalid conversion in
this pass that occurs because existing SDWA selections are not correctly
taken into account during the conversion.

See the draft PR #123221 for an attempt to fix the actual issue.

---------

Co-authored-by: Frederik Harwath <fharwath@amd.com>
This header has been removed in C++20 and causes a large amount of
deprecation spam when building against libstdc++ 15 in C++17 mode.

As far as I understand, we just need to include *some* STL header to get
access to the version macros, and as this header also includes
<optional> nowadays we can just drop the <cstd646> include entirely.
…124089)

Fixes #123800

Extends LDS lowering by allowing it to discover transitive
indirect/escpaing references to LDS GVs.

For example, given the following input:
```llvm
@lds_item_to_indirectly_load = internal addrspace(3) global ptr undef, align 8

%store_type = type { i32, ptr }
@place_to_store_indirect_caller = internal addrspace(3) global %store_type undef, align 8

define amdgpu_kernel void @offloading_kernel() {
  store ptr @indirectly_load_lds, ptr addrspace(3) getelementptr inbounds nuw (i8, ptr addrspace(3) @place_to_store_indirect_caller, i32 0), align 8
  call void @call_unknown()
  ret void
}

define void @call_unknown() {
  %1 = alloca ptr, align 8
  %2 = call i32 %1()
  ret void
}

define void @indirectly_load_lds() {
  call void @directly_load_lds()
  ret void
}

define void @directly_load_lds() {
  %2 = load ptr, ptr addrspace(3) @lds_item_to_indirectly_load, align 8
  ret void
}

```

With the above input, prior to this patch, LDS lowering failed to lower
the reference to `@lds_item_to_indirectly_load` because:
1. it is indirectly called by a function whose address is taken in the
kernel.
2. we did not check if the kernel indirectly makes any calls to unknown
functions (we only checked the direct calls).

Co-authored-by: Jon Chesterfield <jonathan.chesterfield@amd.com>
ayylol and others added 22 commits February 6, 2025 09:54
Matching PR: oneapi-src/unified-runtime#2655

---------

Co-authored-by: Kenneth Benzie (Benie) <k.benzie@codeplay.com>
…s (#16792)

SYCL_ENABLE_PCI is deprecated here
intel/llvm@2d12863.
Remove it as requirement for device info query.
Remove statement that some device info is supported on L0 only if there
is at least one another backend that could return it. (mostly it is L0
+hip/cuda).

---------

Signed-off-by: Tikhomirova, Kseniya <kseniya.tikhomirova@intel.com>
Add autodetection of the CUDA SDK using CMake's builtin `find_package`.

I didn't remove the LIT params or handling in lit.py as to not break
anyone who may want to use a custom install.

Confirmed working
[here](https://github.com/intel/llvm/actions/runs/13166519697/job/36748743079?pr=16896)
as `Adapters/cuda_queue_priority.cpp` passed.

Signed-off-by: Sarnie, Nick <nick.sarnie@intel.com>
…904)

Main reason is to avoid having partial specializations for swizzles in
the implementation of the traits as we're going to have big changes in
that area. Between this change and having to update the traits this
approach seemed better.
Also, make kernel_bundle_impl member variable naming consistent.
Follow-up from intel/llvm#16896

Signed-off-by: Sarnie, Nick <nick.sarnie@intel.com>
…erties (#16728)

As the title says, added deprecation messages for such overloads in the
`sycl_ext_oneapi_kernel_properties extension`, suggesting users to use
`single_task/parallel_for` overloads provided in the
`sycl_ext_oneapi_enqueue_functions` extension instead. (As these
overloads are to be removed later as mentioned in
intel/llvm#14785) Also fixed an affected test
case.

---------

Signed-off-by: Hu, Peisen <peisen.hu@intel.com>
Add `XFAIL: accelerator` to `OCL_interop_test.cpp` because of
[this](https://github.com/intel/llvm/actions/runs/13187415776/job/36814982904)
and remove a rogue print statement meant for debugging purposes during
development which is most likely unrelated to the failure.
An issue for the failure has also been created
intel/llvm#16914
Avoid using partial specialization for swizzles as it depends on the
swizzles implementation and we'll be making significant changes there to
align with the proposed specification changes.
Simplify handling of multiple address spaces and alignment checks.
Additional improvement regarding alignment checks is being done here (to
perform compile-time alignment check instead of expensive dynamic
check): intel/llvm#16882
Also this PR fixes alignment requirement for local address space:
16-byte alignment is required for both load and store.
…913)

That would match with pre-existing `check_type_in_v`.
fixes intel/llvm#16693

---------

Signed-off-by: Tikhomirova, Kseniya <kseniya.tikhomirova@intel.com>
…some Matrix tests (#16855)

Adds `UNSUPPORTED: target-nvidia, target-amd` to matrix tests that also
require `aspect-ext_intel_matrix` since these tests are not yet
supported for those triples.
1. not treating not having two GPU devices as an error, but skip the
test instead.
2. If the two GPU devices are not the same model, skip the test(not our
targeted setup).
@tomflinda tomflinda requested a review from a team as a code owner February 10, 2025 08:59
Signed-off-by: chenwei.sun <chenwei.sun@intel.com>
Signed-off-by: chenwei.sun <chenwei.sun@intel.com>
Signed-off-by: chenwei.sun <chenwei.sun@intel.com>
@tomflinda
Copy link
Contributor Author

This pulldown PR is out of date, so close it.

@tomflinda tomflinda closed this Feb 27, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.