[XeGPU] uArch definition (PR 1/N) #2

mshahneo · 2025-03-28T13:16:25Z

Update:

After the first iteration of review comment, the current PR is update in the following way:

The main pivot for this iteration is the utilities. These utilities are expose via a range of Interfaces (e.g. TileOpInterface, MatrixOpInterface). A specific instruction for a specific Architecture need to implement theses utilities.
The uArch information is now directly embedded in the utility implementation. We are not using .yaml files. However, I haven't removed the .yaml file yet, since, they can still be utilized if choses to do so in the future.
Also, last but not least, the current version provides an end-to-end prototype of how uArch can be used:
-- Full uArch definition, utilities and all (at least for one instruction for now, DPAS).
-- Definition for 2 uArchs (PVC, BMG).
-- A pass to attach the uArch name to the module attribute using DLTI.
-- Shows the usage of the uArch interface to verify XeGPU ops.
-- Few test cases.

P.S. I still kept some old code intentionally, if we want to go back or use some of it. I'll clean up once we agree on the process.

================================

Hi All,

This is the initial PR for uArch definition. The primary purpose for this PR is to have a discussion and finalize the design. So please think of it as an RFC as well.

First target we want to achieve here is to finalize the skeleton:

Skeleton of the YAML file, what information we want in there
Skeleton of Data structures we want to have (e.g., necessary to create utilities to support the need)

Due to this purpose, the main focus point of this PR are the following files:

mlir/lib/Dialect/XeGPU/Utils/intel_gpu_pvc.yaml : holds the YAML definition of uArch (PVC is used for this use case)
mlir/include/mlir/Dialect/XeGPU/Utils/uArch.h : holds the base uArch struct that can be re-used by specialized uArchs such as PVC.
mlir/include/mlir/Dialect/XeGPU/Utils/IntelGpuPVC.h : holds the PVC-specific structs.

mlir/lib/Dialect/XeGPU/Utils/IntelGpuPVC.cpp is added as placeholder for YAML mapping. This file is intentionally not complete. We want to reach a consensus on the YAML, and C++ structures before filling this one in.

Please provide your valuable feedback as we try to finalize this together.

Garra1980 · 2025-03-28T22:13:50Z

mlir/include/mlir/Dialect/XeGPU/Utils/IntelGpuPVC.h

+} // namespace mlir
+
+#endif // MLIR_DIALECT_XEGPU_UTILS_INTEL_GPU_PVC_H
+//===--- IntelGpuPVC.h ---------------------------------------*- C++ -*-===//


this one should be on the top I guess

Ah, yes, I think I auto-added by co-pilot suggestion. Will remove it :)

adam-smnk · 2025-04-01T15:05:22Z

A general question on the structure - what's the benefit of maintaining both yaml and C++ definitions?
Do we expect sth to reuse yaml separately from C++? Can the C++ bindings be autogenerated or will it be a lot of manual boilerplate?

adam-smnk · 2025-04-01T16:14:30Z

mlir/include/mlir/Dialect/XeGPU/Utils/IntelGpuPVC.h

+  Range systolic_depth;
+  Range repreat_count;
+  Range execution_size;
+  std::map<std::string, uint> ops_per_channel;


Opaque string mapping doesn't feel user friendly. Especially as it seems to be wrapping numerical bit width that could be a numerical key directly - I assume it's some limitation due to using (conversion from) yaml.

matrix_size is even worse offender in that regard. These wrappers are hard to use without yaml definitions which again makes me wonder if we really need to split it into two separate files.

Opaque string mapping doesn't feel user friendly. Especially as it seems to be wrapping numerical bit width that could be a numerical key directly - I assume it's some limitation due to using (conversion from) yaml.

You are right, the limitation is due to the yaml mapping.

matrix_size is even worse offender in that regard. These wrappers are hard to use without yaml definitions which again makes me wonder if we really need to split it into two separate files.

Sorry, the matrix size is a mistake. The value should be a vector of uint. But yes, in general you are right, the structs and and yaml to some extent have to be used in conjunction.

mshahneo · 2025-04-02T19:36:29Z

A general question on the structure - what's the benefit of maintaining both yaml and C++ definitions? Do we expect sth to reuse yaml separately from C++? Can the C++ bindings be autogenerated or will it be a lot of manual boilerplate?

The C++ structure actually comes out of a necessity. To use yaml mapping utlity in LLVM, we have to have a C++ object mapped to it, hence the structures. And since we need the C++ structures anyway, I wanted to make some of them to be re-usable across uArchs.

Do we expect sth to reuse yaml separately from C++

Not really, not in this context anyway. But we wanted to keep the base structs in a such a way that they can be used in a non-LLVM cases, but that's more of hope not necessasity.

Can the C++ bindings be autogenerated or will it be a lot of manual boilerplate?

I thought about this, but it takes me back to tablegen. Should we use tablegen?

adam-smnk · 2025-04-03T11:03:08Z

A general question on the structure - what's the benefit of maintaining both yaml and C++ definitions? Do we expect sth to reuse yaml separately from C++? Can the C++ bindings be autogenerated or will it be a lot of manual boilerplate?

The C++ structure actually comes out of a necessity. To use yaml mapping utlity in LLVM, we have to have a C++ object mapped to it, hence the structures. And since we need the C++ structures anyway, I wanted to make some of them to be re-usable across uArchs.

Do we expect sth to reuse yaml separately from C++

Not really, not in this context anyway. But we wanted to keep the base structs in a such a way that they can be used in a non-LLVM cases, but that's more of hope not necessasity.

It feels like the overhead of this approach in its complexity and possible maintenance cost is not really justified. On its own it seems fine but these C++ bindings are not great. 😅
Overall, I'd say yaml is not really a first-class citizen in MLIR so it'd face extra scrutiny during upstreaming.

Can the C++ bindings be autogenerated or will it be a lot of manual boilerplate?

I thought about this, but it takes me back to tablegen. Should we use tablegen?

It's worth a try if what we need easily translates to tablegen entries and structure. That is, if you find yourself at any point fighting against tablegen or in need to introduce custom "hacks", I would not necessarily double down on it.

I would suggest starting with user interfaces to flesh out what's really needed. Then it should become clearer where to store all the uarch details e.g., embed shared memory size directly in getShmemSize() or create a separate struct or pull it from a constexpr dictionary that contains everything etc.

Next quick test would be to see how well it all holds up for another target like battlemage.

mshahneo · 2025-04-04T21:59:45Z

A general question on the structure - what's the benefit of maintaining both yaml and C++ definitions? Do we expect sth to reuse yaml separately from C++? Can the C++ bindings be autogenerated or will it be a lot of manual boilerplate?

The C++ structure actually comes out of a necessity. To use yaml mapping utlity in LLVM, we have to have a C++ object mapped to it, hence the structures. And since we need the C++ structures anyway, I wanted to make some of them to be re-usable across uArchs.

Do we expect sth to reuse yaml separately from C++

Not really, not in this context anyway. But we wanted to keep the base structs in a such a way that they can be used in a non-LLVM cases, but that's more of hope not necessasity.

It feels like the overhead of this approach in its complexity and possible maintenance cost is not really justified. On its own it seems fine but these C++ bindings are not great. 😅 Overall, I'd say yaml is not really a first-class citizen in MLIR so it'd face extra scrutiny during upstreaming.

Can the C++ bindings be autogenerated or will it be a lot of manual boilerplate?

I thought about this, but it takes me back to tablegen. Should we use tablegen?

It's worth a try if what we need easily translates to tablegen entries and structure. That is, if you find yourself at any point fighting against tablegen or in need to introduce custom "hacks", I would not necessarily double down on it.

I would suggest starting with user interfaces to flesh out what's really needed. Then it should become clearer where to store all the uarch details e.g., embed shared memory size directly in getShmemSize() or create a separate struct or pull it from a constexpr dictionary that contains everything etc.

Next quick test would be to see how well it all holds up for another target like battlemage.

Thanks, Adam. I think that's a good idea. Let me try to collect the uArch info battlemage. I looked into them during the design, but more details may help us finding the approach that would be easy to maintain.

llvm#138091) Check this error for more context (https://github.com/compiler-research/CppInterOp/actions/runs/14749797085/job/41407625681?pr=491#step:10:531) This fails with ``` * thread #1, name = 'CppInterOpTests', stop reason = signal SIGSEGV: address not mapped to object (fault address: 0x55500356d6d3) * frame #0: 0x00007fffee41cfe3 libclangCppInterOp.so.21.0gitclang::PragmaNamespace::~PragmaNamespace() + 99 frame #1: 0x00007fffee435666 libclangCppInterOp.so.21.0gitclang::Preprocessor::~Preprocessor() + 3830 frame #2: 0x00007fffee20917a libclangCppInterOp.so.21.0gitstd::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() + 58 frame llvm#3: 0x00007fffee224796 libclangCppInterOp.so.21.0gitclang::CompilerInstance::~CompilerInstance() + 838 frame llvm#4: 0x00007fffee22494d libclangCppInterOp.so.21.0gitclang::CompilerInstance::~CompilerInstance() + 13 frame llvm#5: 0x00007fffed95ec62 libclangCppInterOp.so.21.0gitclang::IncrementalCUDADeviceParser::~IncrementalCUDADeviceParser() + 98 frame llvm#6: 0x00007fffed9551b6 libclangCppInterOp.so.21.0gitclang::Interpreter::~Interpreter() + 102 frame llvm#7: 0x00007fffed95598d libclangCppInterOp.so.21.0gitclang::Interpreter::~Interpreter() + 13 frame llvm#8: 0x00007fffed9181e7 libclangCppInterOp.so.21.0gitcompat::createClangInterpreter(std::vector<char const*, std::allocator<char const*>>&) + 2919 ``` Problem : 1) The destructor currently handles no clearance for the DeviceParser and the DeviceAct. We currently only have this https://github.com/llvm/llvm-project/blob/976493822443c52a71ed3c67aaca9a555b20c55d/clang/lib/Interpreter/Interpreter.cpp#L416-L419 2) The ownership for DeviceCI currently is present in IncrementalCudaDeviceParser. But this should be similar to how the combination for hostCI, hostAction and hostParser are managed by the Interpreter. As on master the DeviceAct and DeviceParser are managed by the Interpreter but not DeviceCI. This is problematic because : IncrementalParser holds a Sema& which points into the DeviceCI. On master, DeviceCI is destroyed before the base class ~IncrementalParser() runs, causing Parser::reset() to access a dangling Sema (and as Sema holds a reference to Preprocessor which owns PragmaNamespace) we see this ``` * frame #0: 0x00007fffee41cfe3 libclangCppInterOp.so.21.0gitclang::PragmaNamespace::~PragmaNamespace() + 99 frame #1: 0x00007fffee435666 libclangCppInterOp.so.21.0gitclang::Preprocessor::~Preprocessor() + 3830 ```

Fix for: `Assertion failed: (false && "Architecture or OS not supported"), function CreateRegisterContextForFrame, file /usr/src/contrib/llvm-project/lldb/source/Plugins/Process/elf-core/ThreadElfCore.cpp, line 182. PLEASE submit a bug report to https://bugs.freebsd.org/submit/ and include the crash backtrace. #0 0x000000080cd857c8 llvm::sys::PrintStackTrace(llvm::raw_ostream&, int) /usr/src/contrib/llvm-project/llvm/lib/Support/Unix/Signals.inc:723:13 #1 0x000000080cd85ed4 /usr/src/contrib/llvm-project/llvm/lib/Support/Unix/Signals.inc:797:3 #2 0x000000080cd82ae8 llvm::sys::RunSignalHandlers() /usr/src/contrib/llvm-project/llvm/lib/Support/Signals.cpp:104:5 llvm#3 0x000000080cd861f0 SignalHandler /usr/src/contrib/llvm-project/llvm/lib/Support/Unix/Signals.inc:403:3 llvm#4 0x000000080f159644 handle_signal /usr/src/lib/libthr/thread/thr_sig.c:298:3 `

The mcmodel=tiny memory model is only valid on ARM targets. While trying this on X86 compiler throws an internal error along with stack dump. llvm#125641 This patch resolves the issue. Reduced test case: ``` #include <stdio.h> int main( void ) { printf( "Hello, World!\n" ); return 0; } ``` ``` 0. Program arguments: /opt/compiler-explorer/clang-trunk/bin/clang++ -gdwarf-4 -g -o /app/output.s -fno-verbose-asm -S --gcc-toolchain=/opt/compiler-explorer/gcc-snapshot -fcolor-diagnostics -fno-crash-diagnostics -mcmodel=tiny <source> 1. <eof> parser at end of file #0 0x0000000003b10218 llvm::sys::PrintStackTrace(llvm::raw_ostream&, int) (/opt/compiler-explorer/clang-trunk/bin/clang+++0x3b10218) #1 0x0000000003b0e35c llvm::sys::CleanupOnSignal(unsigned long) (/opt/compiler-explorer/clang-trunk/bin/clang+++0x3b0e35c) #2 0x0000000003a5dbc3 llvm::CrashRecoveryContext::HandleExit(int) (/opt/compiler-explorer/clang-trunk/bin/clang+++0x3a5dbc3) llvm#3 0x0000000003b05cfe llvm::sys::Process::Exit(int, bool) (/opt/compiler-explorer/clang-trunk/bin/clang+++0x3b05cfe) llvm#4 0x0000000000d4e3eb LLVMErrorHandler(void*, char const*, bool) cc1_main.cpp:0:0 llvm#5 0x0000000003a67c93 llvm::report_fatal_error(llvm::Twine const&, bool) (/opt/compiler-explorer/clang-trunk/bin/clang+++0x3a67c93) llvm#6 0x0000000003a67df8 (/opt/compiler-explorer/clang-trunk/bin/clang+++0x3a67df8) llvm#7 0x0000000002549148 llvm::X86TargetMachine::X86TargetMachine(llvm::Target const&, llvm::Triple const&, llvm::StringRef, llvm::StringRef, llvm::TargetOptions const&, std::optional<llvm::Reloc::Model>, std::optional<llvm::CodeModel::Model>, llvm::CodeGenOptLevel, bool) (/opt/compiler-explorer/clang-trunk/bin/clang+++0x2549148) llvm#8 0x00000000025491fc llvm::RegisterTargetMachine<llvm::X86TargetMachine>::Allocator(llvm::Target const&, llvm::Triple const&, llvm::StringRef, llvm::StringRef, llvm::TargetOptions const&, std::optional<llvm::Reloc::Model>, std::optional<llvm::CodeModel::Model>, llvm::CodeGenOptLevel, bool) (/opt/compiler-explorer/clang-trunk/bin/clang+++0x25491fc) llvm#9 0x0000000003db74cc clang::emitBackendOutput(clang::CompilerInstance&, clang::CodeGenOptions&, llvm::StringRef, llvm::Module*, clang::BackendAction, llvm::IntrusiveRefCntPtr<llvm::vfs::FileSystem>, std::unique_ptr<llvm::raw_pwrite_stream, std::default_delete<llvm::raw_pwrite_stream>>, clang::BackendConsumer*) (/opt/compiler-explorer/clang-trunk/bin/clang+++0x3db74cc) llvm#10 0x0000000004460d95 clang::BackendConsumer::HandleTranslationUnit(clang::ASTContext&) (/opt/compiler-explorer/clang-trunk/bin/clang+++0x4460d95) llvm#11 0x00000000060005ec clang::ParseAST(clang::Sema&, bool, bool) (/opt/compiler-explorer/clang-trunk/bin/clang+++0x60005ec) llvm#12 0x00000000044614b5 clang::CodeGenAction::ExecuteAction() (/opt/compiler-explorer/clang-trunk/bin/clang+++0x44614b5) llvm#13 0x0000000004737121 clang::FrontendAction::Execute() (/opt/compiler-explorer/clang-trunk/bin/clang+++0x4737121) llvm#14 0x00000000046b777b clang::CompilerInstance::ExecuteAction(clang::FrontendAction&) (/opt/compiler-explorer/clang-trunk/bin/clang+++0x46b777b) llvm#15 0x00000000048229e3 clang::ExecuteCompilerInvocation(clang::CompilerInstance*) (/opt/compiler-explorer/clang-trunk/bin/clang+++0x48229e3) llvm#16 0x0000000000d50621 cc1_main(llvm::ArrayRef<char const*>, char const*, void*) (/opt/compiler-explorer/clang-trunk/bin/clang+++0xd50621) llvm#17 0x0000000000d48e2d ExecuteCC1Tool(llvm::SmallVectorImpl<char const*>&, llvm::ToolContext const&) driver.cpp:0:0 llvm#18 0x00000000044acc99 void llvm::function_ref<void ()>::callback_fn<clang::driver::CC1Command::Execute(llvm::ArrayRef<std::optional<llvm::StringRef>>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>*, bool*) const::'lambda'()>(long) Job.cpp:0:0 llvm#19 0x0000000003a5dac3 llvm::CrashRecoveryContext::RunSafely(llvm::function_ref<void ()>) (/opt/compiler-explorer/clang-trunk/bin/clang+++0x3a5dac3) llvm#20 0x00000000044aceb9 clang::driver::CC1Command::Execute(llvm::ArrayRef<std::optional<llvm::StringRef>>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>*, bool*) const (.part.0) Job.cpp:0:0 llvm#21 0x00000000044710dd clang::driver::Compilation::ExecuteCommand(clang::driver::Command const&, clang::driver::Command const*&, bool) const (/opt/compiler-explorer/clang-trunk/bin/clang+++0x44710dd) llvm#22 0x0000000004472071 clang::driver::Compilation::ExecuteJobs(clang::driver::JobList const&, llvm::SmallVectorImpl<std::pair<int, clang::driver::Command const*>>&, bool) const (/opt/compiler-explorer/clang-trunk/bin/clang+++0x4472071) llvm#23 0x000000000447c3fc clang::driver::Driver::ExecuteCompilation(clang::driver::Compilation&, llvm::SmallVectorImpl<std::pair<int, clang::driver::Command const*>>&) (/opt/compiler-explorer/clang-trunk/bin/clang+++0x447c3fc) llvm#24 0x0000000000d4d2b1 clang_main(int, char**, llvm::ToolContext const&) (/opt/compiler-explorer/clang-trunk/bin/clang+++0xd4d2b1) llvm#25 0x0000000000c12464 main (/opt/compiler-explorer/clang-trunk/bin/clang+++0xc12464) llvm#26 0x00007ae43b029d90 (/lib/x86_64-linux-gnu/libc.so.6+0x29d90) llvm#27 0x00007ae43b029e40 __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x29e40) llvm#28 0x0000000000d488c5 _start (/opt/compiler-explorer/clang-trunk/bin/clang+++0xd488c5) ``` --------- Co-authored-by: Shashwathi N <nshashwa@pe31.hpc.amslabs.hpecorp.net>

…142952) This was removed in llvm#135343 in favour of making it a format variable, which we do here. This follows the precedent of the `[opt]` and `[artificial]` markers. Before: ``` thread #1, queue = 'com.apple.main-thread', stop reason = breakpoint 1.2 * frame #0: 0x000000010000037c a.out`inlined1() at inline.cpp:4:3 frame #1: 0x000000010000037c a.out`regular() at inline.cpp:6:17 frame #2: 0x00000001000003b8 a.out`inlined2() at inline.cpp:7:43 frame llvm#3: 0x00000001000003b4 a.out`main at inline.cpp:10:3 frame llvm#4: 0x0000000186345be4 dyld`start + 7040 ``` After (note the `[inlined]` markers): ``` thread #1, queue = 'com.apple.main-thread', stop reason = breakpoint 1.2 * frame #0: 0x000000010000037c a.out`inlined1() at inline.cpp:4:3 [inlined] frame #1: 0x000000010000037c a.out`regular() at inline.cpp:6:17 frame #2: 0x00000001000003b8 a.out`inlined2() at inline.cpp:7:43 [inlined] frame llvm#3: 0x00000001000003b4 a.out`main at inline.cpp:10:3 frame llvm#4: 0x0000000186345be4 dyld`start + 7040 ``` rdar://152642178

mshahneo · 2025-06-26T16:13:31Z

Hi Adam (@adam-smnk), Igor (@Garra1980 ),

Modified the uarch definition based on our discussion.
In this version the Pivot is the utilities which necessary instructions have to implement.
I also moved the information to C++ (as part of get functions directly instead of in yaml).

Please let me know what you think. I only added implementation for DPAS. Will add Load/store/prefetch if we agree on the process.

Thanks a lot in advance :).

P.S. I still kept some old code intentionally, if we want to go back or use some of it. I'll clean up once we agree on the process.

mshahneo · 2025-06-26T20:04:42Z

@silee2 , @nbpatel , @chencha3 , @charithaintc , @Jianhui-Li, @akroviakov
please let me know if you have any comments/suggestions.
Thank you so much in advance :)

Garra1980 · 2025-06-27T16:45:18Z

also cc @tkarna

tkarna · 2025-06-30T08:39:14Z

Do you think xegpu could at some point support the Data Layout and Target Information (DLTI) dialect for querying the uarch info?
FYI @rolfmorel

rolfmorel · 2025-06-30T11:11:47Z

Thanks for adding me, @tkarna ! Agreed, it would be great to collaborate on ways to expose this target info through DLTI.

I think @tkarna's work on performant schedules for Xe - which have plentiful magic values derivable from hw info - should be a good exemplar of a user of Xe's DLTI. Showing how to automatically derive those values through DLTI would be a great success indicator.

I had a quick attempt at what exposing this data through DLTI might look like (here using the transform dialect though you can also do the same queries from C++):

module attribute { #xevm.target<arch = "intel_gpu_pvc", ... = ?any other flags that vary across hardware?> } {
    ...
}
module {transform.with_named_sequence} {
    transform.named_sequence @__main__(%mod: !transform.any_op) {
        %number_of_eus_per_xe_core = transform.dlti.query ["xe_core","#eus"] at %mod  // #xevm.target answers 8
        %number_of_threads_per_eu = transform.dlti.query ["xe_core","eu","#threads"] at %mod  // #xevm.target answers [8, 4]
        
        %shared_memory_size = transform.dlti.query ["xe","shared_memory","size"] at %mod  // #xevm.target answers 524288
        %shared_memory_alignment = transform.dlti.query ["xe","shared_memory","alighment"] at %mod  // #xevm.target answers 64

        %grf_bitwidth = transform.dlti.query ["xe","grf","bitwidth"] at %mod  // #xevm.target answers 512
        %grf_modes = transform.dlti.query ["xe","grf","modes"] at %mod  // #xevm.target answers ["small", "large"]
        %grf_count_per_thread = transform.dlti.query ["xe","grf","mode","small","per_thread"] at %mod  // #xevm.target answers 128

        %dpas_opcode = transform.dlti.query ["xe","inst","dpas","opcode"] at %mod  // #xevm.target answers 0x83
        %dpas_systolic_depth = transform.dlti.query ["xe","inst","dpas","systolic_depth"] at %mod  // #xevm.target answers [8]
        %dpas_repeat_count = transform.dlti.query ["xe","inst","dpas","repeat_count"] at %mod  // #xevm.target answers 1,2,3,4,5,6,7,8 as multiple associations for single result handle OR [1, 8] OR #dlti.map<"min" = 1, "max" = 8> OR #dlti.range<1,2,...,8>/#tune.range<1,2,...,8> which would also respond to "min" and "max" queries,
        %dpas_ops_per_channel = transform.dlti.query ["xe","inst","dpas","ops_per_channel"] at %mod  // #xevm.target answers #dlti.map<19: 1, 16: 2, 8: 4, 4: 8, 2:8>
        
        %dpas_supported_types = transform.dlti.query ["xe","inst","dpas","supported_types"] at %mod  // #xevm.target answers #xevm.dpas_types so that ...
        %dpas_supported_types_B_wildcard = transform.dlti.query ["xe","inst","dpas","supported_types", f16, f16, f16, #dlti._] at %mod  // #xevm.dpas_types (yielded by #xevm.target which answered prefix) answers f16
        %dpas_supported_types_Dst_wildcard = transform.dlti.query ["xe","inst","dpas","supported_types", #dlti._, f16, f16, f16] at %mod  // #xevm.dpas_types (yielded by #xevm.target which answered prefix) answers f16,f32 (i.e. two type params associated to the handle)
        %dpas_supported_types_Dst_and_Acc_wildcard:2 = transform.dlti.query ["xe","inst","dpas","supported_types", #dlti._, #dlti._, f16, f16] at %mod  // #xevm.dpas_types (yielded by #xevm.target which answered prefix) answers f16,f32,f16,f32 for first result and f16,f16,f32,f32 for second result
        
        ... // transform ops would take the above Values as parameters
    }
}

That is, there's an #xevm.target attribute (or #xegpu.target) that handles most queries directly though it can also delegate to attributes like #xevm.dpas_types (in this case constructed by xevm.target) if so desired. This means there should be a XeTargetAttr implementing the DLTIQueryInterface's single query(..) method. We can discuss what's the best mechanism for implementing this method.

Not every query above is (easily) implemented with current upstream DLTI. I am working on a PR (oriented towards DLTI attributes that respond to queries by querying for target info and applying cost models) that will enable all the above that I will push as a draft this week.

mshahneo · 2025-06-30T15:10:24Z

@tkarna , yes, we would like to.
Current plan was to utilize DLTI only to query/expose the target triple.

As @rolfmorel pointed out, the current DLTI is too verbose and simple, @adam-smnk & I spoke about using DLTI directly, the issue was mapping all the hardware info in the DLTI attribute was making the IR way too verbose and big (e.g., think the .yaml file in the atttibute list :().

Thanks @rolfmorel for the detailed example. I remember you said that you envision that a query can essentially take a C++ function as key. That's why in the current iteration, I tried to make the utilities a first class citizen, this way, any ops that implement these interfaces can be queried via DLTI:

%dpas_supported_M_sizes = transform.dlti.query ["xe","inst","dpas","get_suppported_M", "bf16"] at %mod // #xevm.target answers the possible M sizes for type bf16...

This would essentially call that specific uArch's implementation of getSupportedM() function.

I'd like to discuss more with you all on this.

adam-smnk

I like the general direction 👍
The overall concept to have common descriptors (the uArch.h) which can have platform specific implementations (PVC, battlemage variants etc.) is great.

I haven't paid too close attention to exact implementation yet. However, it is a bit hard to really judge these helper APIs and their exact abstraction right now. I think it might be best to just start with a small prototype as soon as possible and just iterate on the go.

@chencha3 I think the on-going XeGPU verifier relaxation could great opportunity to run more hands-on experiments with this infrastructure. Perhaps a small uArch-aware validation pass for XeGPU or XeVM ops?

@tkarna If you could share what key hardware properties you already use for tuning or which further info could help you in that, it'd be super valuable feedback to help steer this design.

adam-smnk · 2025-06-30T16:03:12Z

mlir/include/mlir/Dialect/XeGPU/Utils/uArch.h

+  uint no_of_dims;
+  std::vector<uint> dims;


What exactly is the meaning of no_of_dims.
In this case, if dims.size() is different than no_of_dims then what happens?

You are right, we could just remove no_of_dims and get it from dims.size().
Maybe remove the class itself. Let me see.

mlir/include/mlir/Dialect/XeGPU/Utils/uArch.h

tkarna · 2025-07-01T13:44:58Z

@tkarna If you could share what key hardware properties you already use for tuning or which further info could help you in that, it'd be super valuable feedback to help steer this design.

In my preliminary tests I've use the following parameters for 4k matmul benchmark.

Fixed parameters:

DPAS tile sizes (n, m, k)

Tunable parameters:

WG tile (n, m); SG tile (n, m); K tile size
Prefetch tile sizes for A and B
Block (load) tile sizes for A and B

Implemented constrains: tile size must not exceed the parent tile size and divide parent tile evenly. SG tiling must not exceed max number of threads (32). Prefetch tile size must be compatible with the SG level tread configuration (coop prefetching).

There are other hardware constraints I have not explicitly expressed yet, for example maximum/supported load tile size and VNNI packing constraints for B. Some configs do spill registers and run slower.

mshahneo · 2025-07-01T15:29:15Z

I like the general direction 👍 The overall concept to have common descriptors (the uArch.h) which can have platform specific implementations (PVC, battlemage variants etc.) is great.

I haven't paid too close attention to exact implementation yet. However, it is a bit hard to really judge these helper APIs and their exact abstraction right now. I think it might be best to just start with a small prototype as soon as possible and just iterate on the go.

@chencha3 I think the on-going XeGPU verifier relaxation could great opportunity to run more hands-on experiments with this infrastructure. Perhaps a small uArch-aware validation pass for XeGPU or XeVM ops?

@tkarna If you could share what key hardware properties you already use for tuning or which further info could help you in that, it'd be super valuable feedback to help steer this design.

Thank you so much, @adam-smnk for the feedback :-) .
I am working on the prototype, I should be able push the changes today.

The function already exposes a work list to avoid deep recursion, this commit starts utilizing it in a helper that could also lead to a deep recursion. We have observed this crash on `clang/test/C/C99/n590.c` with our internal builds that enable aggressive optimizations and hit the limit earlier than default release builds of Clang. See the added test for an example with a deeper recursion that used to crash in upstream Clang before this change with the following stack trace: ``` #0 llvm::sys::PrintStackTrace(llvm::raw_ostream&, int) /usr/local/google/home/ibiryukov/code/llvm-project/llvm/lib/Support/Unix/Signals.inc:804:13 #1 llvm::sys::RunSignalHandlers() /usr/local/google/home/ibiryukov/code/llvm-project/llvm/lib/Support/Signals.cpp:106:18 #2 SignalHandler(int, siginfo_t*, void*) /usr/local/google/home/ibiryukov/code/llvm-project/llvm/lib/Support/Unix/Signals.inc:0:3 llvm#3 (/lib/x86_64-linux-gnu/libc.so.6+0x3fdf0) llvm#4 AnalyzeImplicitConversions(clang::Sema&, clang::Expr*, clang::SourceLocation, bool) /usr/local/google/home/ibiryukov/code/llvm-project/clang/lib/Sema/SemaChecking.cpp:12772:0 llvm#5 CheckCommaOperand /usr/local/google/home/ibiryukov/code/llvm-project/clang/lib/Sema/SemaChecking.cpp:0:3 llvm#6 AnalyzeImplicitConversions /usr/local/google/home/ibiryukov/code/llvm-project/clang/lib/Sema/SemaChecking.cpp:12644:7 llvm#7 AnalyzeImplicitConversions(clang::Sema&, clang::Expr*, clang::SourceLocation, bool) /usr/local/google/home/ibiryukov/code/llvm-project/clang/lib/Sema/SemaChecking.cpp:12776:5 llvm#8 CheckCommaOperand /usr/local/google/home/ibiryukov/code/llvm-project/clang/lib/Sema/SemaChecking.cpp:0:3 llvm#9 AnalyzeImplicitConversions /usr/local/google/home/ibiryukov/code/llvm-project/clang/lib/Sema/SemaChecking.cpp:12644:7 llvm#10 AnalyzeImplicitConversions(clang::Sema&, clang::Expr*, clang::SourceLocation, bool) /usr/local/google/home/ibiryukov/code/llvm-project/clang/lib/Sema/SemaChecking.cpp:12776:5 llvm#11 CheckCommaOperand /usr/local/google/home/ibiryukov/code/llvm-project/clang/lib/Sema/SemaChecking.cpp:0:3 llvm#12 AnalyzeImplicitConversions /usr/local/google/home/ibiryukov/code/llvm-project/clang/lib/Sema/SemaChecking.cpp:12644:7 llvm#13 AnalyzeImplicitConversions(clang::Sema&, clang::Expr*, clang::SourceLocation, bool) /usr/local/google/home/ibiryukov/code/llvm-project/clang/lib/Sema/SemaChecking.cpp:12776:5 llvm#14 CheckCommaOperand /usr/local/google/home/ibiryukov/code/llvm-project/clang/lib/Sema/SemaChecking.cpp:0:3 llvm#15 AnalyzeImplicitConversions /usr/local/google/home/ibiryukov/code/llvm-project/clang/lib/Sema/SemaChecking.cpp:12644:7 llvm#16 AnalyzeImplicitConversions(clang::Sema&, clang::Expr*, clang::SourceLocation, bool) /usr/local/google/home/ibiryukov/code/llvm-project/clang/lib/Sema/SemaChecking.cpp:12776:5 llvm#17 CheckCommaOperand /usr/local/google/home/ibiryukov/code/llvm-project/clang/lib/Sema/SemaChecking.cpp:0:3 llvm#18 AnalyzeImplicitConversions /usr/local/google/home/ibiryukov/code/llvm-project/clang/lib/Sema/SemaChecking.cpp:12644:7 llvm#19 AnalyzeImplicitConversions(clang::Sema&, clang::Expr*, clang::SourceLocation, bool) /usr/local/google/home/ibiryukov/code/llvm-project/clang/lib/Sema/SemaChecking.cpp:12776:5 ... 700+ more stack frames. ```

This version focuses on the utilities to be the pivot. It also saves info directly in C++ files as part of get functions. Don't use the yamls anymore. Adds support for DPAS instruction.

Remove compiler warnings and errors.

This way it's available always.

mshahneo · 2025-07-02T23:21:08Z

Hi @adam-smnk , @rolfmorel , @tkarna , @Jianhui-Li, @silee2, @chencha3 , @charithaintc,

I added modified the PR to add a prototype. Please take a look at your conveniences.

Garra1980 · 2025-07-03T14:01:29Z

mlir/include/mlir/Dialect/XeGPU/uArch/IntelGpuXe2.h

Am I correct that this file is supposed to replace current https://github.com/llvm/llvm-project/blob/main/mlir/include/mlir/Dialect/XeGPU/IR/XeGPUTargetInfo.h?

Potentially, yes.

Garra1980 · 2025-07-03T14:38:57Z

mlir/lib/Dialect/XeGPU/IR/XeGPUOps.cpp

@@ -577,6 +579,45 @@ LogicalResult DpasOp::verify() {
  if (getAcc() && getAcc().getType() != getResultType())
    return emitOpError("Expecting the acc type to be the same as result.");

+  // @uArch: Check if the types are supported for DPAS.
+  Operation *op = getOperation();


I guess this piece can be hidden inside uArch utilities so client just say smth like
auto res = verify(op, "dpas");
?

The issue with this is that it requires the utility to have all the parameters for of the op.
Alternatively, the op would have to implement the same interface so that the uArch utility can get the parameters.

chencha3 · 2025-07-07T20:35:22Z

mlir/test/Dialect/XeGPU/uarch-info.mlir

@@ -0,0 +1,14 @@
+module @eltwise_add attributes {gpu.container_module} {


It seems the test checks are missing here.

Ah, sorry Chao, this test is supposed to be removed. The other 2 tests covers the mechanism. I pushed it initially forgot to remove it.

chencha3 · 2025-07-07T20:50:19Z

mlir/include/mlir/Dialect/XeGPU/Transforms/Passes.td

+      "xegpu::XeGPUDialect", "mlir::DLTIDialect"
+  ];
+}
+


I want to raise the following discussion points:

NV and AMD are putting nvvm-attach-target and rocdl-attach-target under the GPU dialect. Not sure whether it is better for us to follow their way too.

They are attached to GPUModuleOp, a finer granularity than us. It may be a requirement of GPU dialect. A potential benefit is that they may be able to support the case that a ModuleOp containing multiple GPUModuleOp, with each GPUMoudleOp map to different GPUs from different vendors.

They used NVVMTargetAttr and ROCDLTargetAttr to store the info. Now, we have XeVMTargetAttr, can we reuse it here too? Considering we only capture the device name here, which is also captured by XeVMTargetAttr?

Sang Ik created this PR: llvm#147372. is it similar to the pass here?

I agree, this appears similar to the xevm's attach target pass that is yet to be merged. After the merge, the chip string of the xevm target should be appropriate to query uarch info:

auto targetDeviceArch = mlir::xegpu::uArch::uArchMap::instance().get(targetDeviceNameStr);

Do the XeVM targets have a sensible notion of a LLVM triple along side that of chip? How about (LLVM) features? If so, it would probably make sense to have a #xevm.target attribute, a la the proposed above, which implements the LLVMTargetAttrInterface from this WIP upstream PR:
https://github.com/llvm/llvm-project/pull/145899/files#diff-6c2503d165a7390c955c3c4fa4a76fd1991633b5ec597a1b1fd92731e1f3684dR572

I am in the process of making #nvvm.target and #rocdl.target attributes implement this interface (in addition to the generic #llvm.target attr in that PR). If sensible, the XeVM dialect should probably (try to) mirror this.

Given this in https://github.com/llvm/llvm-project/pull/147372/files#diff-3da7b8747032bb581f119fd49037d06706443c3c5db56d4d28cc74053a9b754dR284 in the Add xevm-attach-target transform pass PR, I take it there's a sensible triple.

I will have a look at making xevm.target also implement my interface.

Thanks Chao, Rolf, Artem.

You are right once XeVM target attribute gets upstreamed, we should use it. This pass is here just to show the use case.

👍

On that note, I found that the #xevm.target attr itself already got merged: https://github.com/llvm/llvm-project/blob/b9d7513bf134febe72c05a04ff20f87191d7213a/mlir/include/mlir/Dialect/LLVMIR/XeVMOps.td#L521

We can get cracking!

Jianhui-Li · 2025-07-08T01:26:13Z

mlir/lib/Dialect/XeGPU/uArch/intel_gpu_pvc.yaml

+  dpas:
+    instruction_name: dpas
+    instruction_description: "Dot Product Accumulate Systolic (DPAS) is a matrix multiply-add operation"
+    instruction_opcode: "0x83"


why opcode ?

I added some extra fields in the structs in case they become useful in the future, they are not useful in our current use-case. I can remove them if we don't think they are necessary.

Jianhui-Li · 2025-07-08T02:17:23Z

mlir/include/mlir/Dialect/XeGPU/uArch/uArch.h

+//   no_of_dims: 2
+//   dim: [2, 2]
+// This represents a 2x2 tile
+struct Tile {


Tile is overloaded term. It looks to me just a shape. Does there any pre-defined data type you can reuse?

Sorry, Jianhui. I am not using it in the current version of the implementation. I kept it if we want to use it in some fashion later.

The current implementation uses std::vector.

@mshahneo Looks like we should remove older version of the interface and keep the one we want to proceed with

Jianhui-Li · 2025-07-08T02:20:09Z

mlir/include/mlir/Dialect/XeGPU/uArch/uArch.h

+//     - [2, 16]
+// This represents a 2x2 RangeTile where the first dimension can have values
+// from 1 to 32 and the second dimension can have values from 2 to 16
+struct RangeTile {


why not just use DiscreteTile, if this is not used that often.

Kept two different ways to specify the range.

Jianhui-Li · 2025-07-08T02:25:43Z

mlir/include/mlir/Dialect/XeGPU/uArch/uArch.h

+//   dims:
+//     - [1, 2, 4, 8, 16, 32]
+//     - [2, 4, 8, 16]
+// This represents a 2x2 DiscreteTile where the first dimension can have values


It seems that DiscreteTile represents shapes combined from lists of sizes. How about name them ShapeCombination or something more specific to our problem and less general?

That makes sense. Current implementation actually uses std::vector.

Jianhui-Li · 2025-07-08T02:28:09Z

mlir/include/mlir/Dialect/XeGPU/uArch/uArch.h

+  std::tuple<Args...> data;
+  std::function<bool(Args...)> func;
+
+  Restriction(Args... args, std::function<bool(Args...)> f)


The term "restriction" is also very generic. Please consider making it specific. Validate the instance belongs to a combination?

Restriction is supposed to be model all restriction.
And a common validate() could just apply all the restrictions. However, we can always go back to specific validations. I'll think about this.

Jianhui-Li · 2025-07-08T02:32:25Z

mlir/lib/Dialect/XeGPU/uArch/IntelGpuXe2.cpp

+  return {1, 2, 3, 4, 5, 6, 7, 8};
+}
+
+std::vector<uint32_t> DPASInstruction::getSupportedK(mlir::Type type) {


I am a bit confused. I thought you will read the information from dlti, not hard-coded inside the code.

These hardware properties are static so I'd say they should be hardcoded somewhere.
Whether you can extract them from LLVM, inside DLTI query function, or as a separate helper like this (or IMEX's target info) is matter of what's most convenient.

I don't think we should be dumping whole hardware specification into IR as DLTI attributes.

Jianhui-Li · 2025-07-08T02:35:38Z

mlir/lib/Dialect/XeGPU/IR/XeGPUOps.cpp

+    llvm::errs() << "No parent module op.\n";
+
+  // It target device info is not attched, skip the target-specific checks
+  auto targetDeviceNameAttr = dlti::query(moduleOp, {"GPU", "name"});


We should discuss where to check the uArch info. The uArch info could be checked in the lowering passes, and the op definition should check general constraints (like a.m = c.m).

Actually the way the uArch info is set up. Once the target is attached, it can be queried from everywhere (pass and dialect).

Jianhui-Li · 2025-07-08T02:35:52Z

mlir/lib/Dialect/XeGPU/uArch/intel_gpu_pvc.yaml

+
+---
+# The target architecture name.
+arch: intel_gpu_pvc


how is this file being loaded?

That's an older artifact, we shouldn't have yaml at all.

Jianhui-Li · 2025-07-08T02:39:34Z

mlir/lib/Dialect/XeGPU/uArch/intel_gpu_pvc.yaml

+    instruction_description: "2D Block Load instruction reads a rectangular block of memory in to GRF."
+    instruction_opcode: nan
+    instruction_functional_unit: load
+    instruction_type: simd


General: I don't think the yaml should write following the terminology in the internal uArch spec for PVC (like simd, systolic_depth, repeat_count, vnni ...). Instead, it should written following the SPIRV extnesion (or the OCL C function) (no mentioning of simd, using m/n/k, packing).

Hi Jianhui,

We are not using the YAML in the current version.

mshahneo force-pushed the uarch_definition branch from 4666e33 to 5dd0ebe Compare March 28, 2025 13:20

mshahneo requested review from nbpatel and chencha3 March 28, 2025 13:27

Garra1980 reviewed Mar 28, 2025

View reviewed changes

mshahneo requested a review from silee2 March 31, 2025 16:07

adam-smnk reviewed Apr 1, 2025

View reviewed changes

mshahneo changed the title ~~uArch definition (PR 1/N)~~ [XeGPU] uArch definition (PR 1/N) Jun 26, 2025

adam-smnk reviewed Jun 30, 2025

View reviewed changes

mshahneo force-pushed the uarch_definition branch 2 times, most recently from 042311e to 94ae51e Compare July 2, 2025 05:05

mshahneo changed the base branch from main to dummy_main July 2, 2025 05:08

mshahneo added 2 commits July 2, 2025 05:10

[uArch][XeGPU] Add uArch definition.

988bf66

Address review comments.

f853e5a

mshahneo added 4 commits July 2, 2025 05:10

Modify the uArch definition.

44ea3a4

This version focuses on the utilities to be the pivot. It also saves info directly in C++ files as part of get functions. Don't use the yamls anymore. Adds support for DPAS instruction.

Add necessary infrastructures for the uArch to show the full pipeline.

6c45d97

Add BMG to the uArch.

ffd9d69

Move uArch to a new folder.

af7098b

Remove compiler warnings and errors.

mshahneo force-pushed the uarch_definition branch from 94ae51e to af7098b Compare July 2, 2025 05:11

mshahneo deleted the branch main July 2, 2025 05:12

mshahneo closed this Jul 2, 2025

mshahneo reopened this Jul 2, 2025

mshahneo changed the base branch from dummy_main to main July 2, 2025 05:17

mshahneo added 5 commits July 2, 2025 11:44

Use shared_pointer across the board.

6a61603

Add xegpu-attch-target-device pass.

9d4dfca

Add the usage of uArch in XeGPU verification.

78677f4

Remove comments, add test cases.

b05de85

Move uArchMap population in XeGPUDialect Initialization phase.

f4e3372

This way it's available always.

Garra1980 reviewed Jul 3, 2025

View reviewed changes

chencha3 reviewed Jul 7, 2025

View reviewed changes

Jianhui-Li reviewed Jul 8, 2025

View reviewed changes

		@@ -0,0 +1,14 @@
		module @eltwise_add attributes {gpu.container_module} {

[XeGPU] uArch definition (PR 1/N) #2

Are you sure you want to change the base?

[XeGPU] uArch definition (PR 1/N) #2

Conversation

mshahneo commented Mar 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Update:

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

adam-smnk commented Apr 1, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mshahneo commented Apr 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

adam-smnk commented Apr 3, 2025

Uh oh!

mshahneo commented Apr 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mshahneo commented Jun 26, 2025

Uh oh!

mshahneo commented Jun 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Garra1980 commented Jun 27, 2025

Uh oh!

tkarna commented Jun 30, 2025

Uh oh!

rolfmorel commented Jun 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mshahneo commented Jun 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

adam-smnk left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

tkarna commented Jul 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mshahneo commented Jul 1, 2025

Uh oh!

mshahneo commented Jul 2, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rolfmorel Jul 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mshahneo commented Mar 28, 2025 •

edited

Loading

mshahneo commented Apr 2, 2025 •

edited

Loading

mshahneo commented Apr 4, 2025 •

edited

Loading

mshahneo commented Jun 26, 2025 •

edited

Loading

rolfmorel commented Jun 30, 2025 •

edited

Loading

mshahneo commented Jun 30, 2025 •

edited

Loading

tkarna commented Jul 1, 2025 •

edited

Loading

rolfmorel Jul 8, 2025 •

edited

Loading

Jianhui-Li Jul 8, 2025 •

edited

Loading