Skip to content

[XeGPU] uArch definition (PR 1/N) #2

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 11 commits into
base: main
Choose a base branch
from
Open

[XeGPU] uArch definition (PR 1/N) #2

wants to merge 11 commits into from

Conversation

mshahneo
Copy link
Owner

@mshahneo mshahneo commented Mar 28, 2025

Update:

After the first iteration of review comment, the current PR is update in the following way:

  • The main pivot for this iteration is the utilities. These utilities are expose via a range of Interfaces (e.g. TileOpInterface, MatrixOpInterface). A specific instruction for a specific Architecture need to implement theses utilities.
  • The uArch information is now directly embedded in the utility implementation. We are not using .yaml files. However, I haven't removed the .yaml file yet, since, they can still be utilized if choses to do so in the future.
  • Also, last but not least, the current version provides an end-to-end prototype of how uArch can be used:
    -- Full uArch definition, utilities and all (at least for one instruction for now, DPAS).
    -- Definition for 2 uArchs (PVC, BMG).
    -- A pass to attach the uArch name to the module attribute using DLTI.
    -- Shows the usage of the uArch interface to verify XeGPU ops.
    -- Few test cases.

P.S. I still kept some old code intentionally, if we want to go back or use some of it. I'll clean up once we agree on the process.

================================

Hi All,

This is the initial PR for uArch definition. The primary purpose for this PR is to have a discussion and finalize the design. So please think of it as an RFC as well.

First target we want to achieve here is to finalize the skeleton:

  • Skeleton of the YAML file, what information we want in there
  • Skeleton of Data structures we want to have (e.g., necessary to create utilities to support the need)

Due to this purpose, the main focus point of this PR are the following files:

  • mlir/lib/Dialect/XeGPU/Utils/intel_gpu_pvc.yaml : holds the YAML definition of uArch (PVC is used for this use case)
  • mlir/include/mlir/Dialect/XeGPU/Utils/uArch.h : holds the base uArch struct that can be re-used by specialized uArchs such as PVC.
  • mlir/include/mlir/Dialect/XeGPU/Utils/IntelGpuPVC.h : holds the PVC-specific structs.

mlir/lib/Dialect/XeGPU/Utils/IntelGpuPVC.cpp is added as placeholder for YAML mapping. This file is intentionally not complete. We want to reach a consensus on the YAML, and C++ structures before filling this one in.

Please provide your valuable feedback as we try to finalize this together.

@mshahneo mshahneo requested review from nbpatel and chencha3 March 28, 2025 13:27
} // namespace mlir

#endif // MLIR_DIALECT_XEGPU_UTILS_INTEL_GPU_PVC_H
//===--- IntelGpuPVC.h ---------------------------------------*- C++ -*-===//
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this one should be on the top I guess

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, yes, I think I auto-added by co-pilot suggestion. Will remove it :)

@mshahneo mshahneo requested a review from silee2 March 31, 2025 16:07
@adam-smnk
Copy link
Collaborator

A general question on the structure - what's the benefit of maintaining both yaml and C++ definitions?
Do we expect sth to reuse yaml separately from C++? Can the C++ bindings be autogenerated or will it be a lot of manual boilerplate?

Range systolic_depth;
Range repreat_count;
Range execution_size;
std::map<std::string, uint> ops_per_channel;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opaque string mapping doesn't feel user friendly. Especially as it seems to be wrapping numerical bit width that could be a numerical key directly - I assume it's some limitation due to using (conversion from) yaml.

matrix_size is even worse offender in that regard. These wrappers are hard to use without yaml definitions which again makes me wonder if we really need to split it into two separate files.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opaque string mapping doesn't feel user friendly. Especially as it seems to be wrapping numerical bit width that could be a numerical key directly - I assume it's some limitation due to using (conversion from) yaml.

You are right, the limitation is due to the yaml mapping.

matrix_size is even worse offender in that regard. These wrappers are hard to use without yaml definitions which again makes me wonder if we really need to split it into two separate files.

Sorry, the matrix size is a mistake. The value should be a vector of uint. But yes, in general you are right, the structs and and yaml to some extent have to be used in conjunction.

@mshahneo
Copy link
Owner Author

mshahneo commented Apr 2, 2025

A general question on the structure - what's the benefit of maintaining both yaml and C++ definitions? Do we expect sth to reuse yaml separately from C++? Can the C++ bindings be autogenerated or will it be a lot of manual boilerplate?

The C++ structure actually comes out of a necessity. To use yaml mapping utlity in LLVM, we have to have a C++ object mapped to it, hence the structures. And since we need the C++ structures anyway, I wanted to make some of them to be re-usable across uArchs.

Do we expect sth to reuse yaml separately from C++

Not really, not in this context anyway. But we wanted to keep the base structs in a such a way that they can be used in a non-LLVM cases, but that's more of hope not necessasity.

Can the C++ bindings be autogenerated or will it be a lot of manual boilerplate?

I thought about this, but it takes me back to tablegen. Should we use tablegen?

@adam-smnk
Copy link
Collaborator

A general question on the structure - what's the benefit of maintaining both yaml and C++ definitions? Do we expect sth to reuse yaml separately from C++? Can the C++ bindings be autogenerated or will it be a lot of manual boilerplate?

The C++ structure actually comes out of a necessity. To use yaml mapping utlity in LLVM, we have to have a C++ object mapped to it, hence the structures. And since we need the C++ structures anyway, I wanted to make some of them to be re-usable across uArchs.

Do we expect sth to reuse yaml separately from C++

Not really, not in this context anyway. But we wanted to keep the base structs in a such a way that they can be used in a non-LLVM cases, but that's more of hope not necessasity.

It feels like the overhead of this approach in its complexity and possible maintenance cost is not really justified. On its own it seems fine but these C++ bindings are not great. 😅
Overall, I'd say yaml is not really a first-class citizen in MLIR so it'd face extra scrutiny during upstreaming.

Can the C++ bindings be autogenerated or will it be a lot of manual boilerplate?

I thought about this, but it takes me back to tablegen. Should we use tablegen?

It's worth a try if what we need easily translates to tablegen entries and structure. That is, if you find yourself at any point fighting against tablegen or in need to introduce custom "hacks", I would not necessarily double down on it.

I would suggest starting with user interfaces to flesh out what's really needed. Then it should become clearer where to store all the uarch details e.g., embed shared memory size directly in getShmemSize() or create a separate struct or pull it from a constexpr dictionary that contains everything etc.

Next quick test would be to see how well it all holds up for another target like battlemage.

@mshahneo
Copy link
Owner Author

mshahneo commented Apr 4, 2025

A general question on the structure - what's the benefit of maintaining both yaml and C++ definitions? Do we expect sth to reuse yaml separately from C++? Can the C++ bindings be autogenerated or will it be a lot of manual boilerplate?

The C++ structure actually comes out of a necessity. To use yaml mapping utlity in LLVM, we have to have a C++ object mapped to it, hence the structures. And since we need the C++ structures anyway, I wanted to make some of them to be re-usable across uArchs.

Do we expect sth to reuse yaml separately from C++

Not really, not in this context anyway. But we wanted to keep the base structs in a such a way that they can be used in a non-LLVM cases, but that's more of hope not necessasity.

It feels like the overhead of this approach in its complexity and possible maintenance cost is not really justified. On its own it seems fine but these C++ bindings are not great. 😅 Overall, I'd say yaml is not really a first-class citizen in MLIR so it'd face extra scrutiny during upstreaming.

Can the C++ bindings be autogenerated or will it be a lot of manual boilerplate?

I thought about this, but it takes me back to tablegen. Should we use tablegen?

It's worth a try if what we need easily translates to tablegen entries and structure. That is, if you find yourself at any point fighting against tablegen or in need to introduce custom "hacks", I would not necessarily double down on it.

I would suggest starting with user interfaces to flesh out what's really needed. Then it should become clearer where to store all the uarch details e.g., embed shared memory size directly in getShmemSize() or create a separate struct or pull it from a constexpr dictionary that contains everything etc.

Next quick test would be to see how well it all holds up for another target like battlemage.

Thanks, Adam. I think that's a good idea. Let me try to collect the uArch info battlemage. I looked into them during the design, but more details may help us finding the approach that would be easy to maintain.

mshahneo pushed a commit that referenced this pull request May 6, 2025
llvm#138091)

Check this error for more context
(https://github.com/compiler-research/CppInterOp/actions/runs/14749797085/job/41407625681?pr=491#step:10:531)

This fails with 
```
* thread #1, name = 'CppInterOpTests', stop reason = signal SIGSEGV: address not mapped to object (fault address: 0x55500356d6d3)
  * frame #0: 0x00007fffee41cfe3 libclangCppInterOp.so.21.0gitclang::PragmaNamespace::~PragmaNamespace() + 99
    frame #1: 0x00007fffee435666 libclangCppInterOp.so.21.0gitclang::Preprocessor::~Preprocessor() + 3830
    frame #2: 0x00007fffee20917a libclangCppInterOp.so.21.0gitstd::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() + 58
    frame llvm#3: 0x00007fffee224796 libclangCppInterOp.so.21.0gitclang::CompilerInstance::~CompilerInstance() + 838
    frame llvm#4: 0x00007fffee22494d libclangCppInterOp.so.21.0gitclang::CompilerInstance::~CompilerInstance() + 13
    frame llvm#5: 0x00007fffed95ec62 libclangCppInterOp.so.21.0gitclang::IncrementalCUDADeviceParser::~IncrementalCUDADeviceParser() + 98
    frame llvm#6: 0x00007fffed9551b6 libclangCppInterOp.so.21.0gitclang::Interpreter::~Interpreter() + 102
    frame llvm#7: 0x00007fffed95598d libclangCppInterOp.so.21.0gitclang::Interpreter::~Interpreter() + 13
    frame llvm#8: 0x00007fffed9181e7 libclangCppInterOp.so.21.0gitcompat::createClangInterpreter(std::vector<char const*, std::allocator<char const*>>&) + 2919
```

Problem : 

1) The destructor currently handles no clearance for the DeviceParser
and the DeviceAct. We currently only have this

https://github.com/llvm/llvm-project/blob/976493822443c52a71ed3c67aaca9a555b20c55d/clang/lib/Interpreter/Interpreter.cpp#L416-L419

2) The ownership for DeviceCI currently is present in
IncrementalCudaDeviceParser. But this should be similar to how the
combination for hostCI, hostAction and hostParser are managed by the
Interpreter. As on master the DeviceAct and DeviceParser are managed by
the Interpreter but not DeviceCI. This is problematic because :
IncrementalParser holds a Sema& which points into the DeviceCI. On
master, DeviceCI is destroyed before the base class ~IncrementalParser()
runs, causing Parser::reset() to access a dangling Sema (and as Sema
holds a reference to Preprocessor which owns PragmaNamespace) we see
this
```
  * frame #0: 0x00007fffee41cfe3 libclangCppInterOp.so.21.0gitclang::PragmaNamespace::~PragmaNamespace() + 99
    frame #1: 0x00007fffee435666 libclangCppInterOp.so.21.0gitclang::Preprocessor::~Preprocessor() + 3830
    
```
mshahneo pushed a commit that referenced this pull request May 6, 2025
Fix for:
`Assertion failed: (false && "Architecture or OS not supported"),
function CreateRegisterContextForFrame, file
/usr/src/contrib/llvm-project/lldb/source/Plugins/Process/elf-core/ThreadElfCore.cpp,
line 182.
PLEASE submit a bug report to https://bugs.freebsd.org/submit/ and
include the crash backtrace.
#0 0x000000080cd857c8 llvm::sys::PrintStackTrace(llvm::raw_ostream&,
int)
/usr/src/contrib/llvm-project/llvm/lib/Support/Unix/Signals.inc:723:13
#1 0x000000080cd85ed4
/usr/src/contrib/llvm-project/llvm/lib/Support/Unix/Signals.inc:797:3
#2 0x000000080cd82ae8 llvm::sys::RunSignalHandlers()
/usr/src/contrib/llvm-project/llvm/lib/Support/Signals.cpp:104:5
llvm#3 0x000000080cd861f0 SignalHandler
/usr/src/contrib/llvm-project/llvm/lib/Support/Unix/Signals.inc:403:3 llvm#4
0x000000080f159644 handle_signal
/usr/src/lib/libthr/thread/thr_sig.c:298:3
`
mshahneo pushed a commit that referenced this pull request May 8, 2025
The mcmodel=tiny memory model is only valid on ARM targets. While trying
this on X86 compiler throws an internal error along with stack dump.
llvm#125641
This patch resolves the issue.
Reduced test case:
```
#include <stdio.h>
int main( void )
{
printf( "Hello, World!\n" ); 
return 0; 
}
```
```
0.	Program arguments: /opt/compiler-explorer/clang-trunk/bin/clang++ -gdwarf-4 -g -o /app/output.s -fno-verbose-asm -S --gcc-toolchain=/opt/compiler-explorer/gcc-snapshot -fcolor-diagnostics -fno-crash-diagnostics -mcmodel=tiny <source>
1.	<eof> parser at end of file
 #0 0x0000000003b10218 llvm::sys::PrintStackTrace(llvm::raw_ostream&, int) (/opt/compiler-explorer/clang-trunk/bin/clang+++0x3b10218)
 #1 0x0000000003b0e35c llvm::sys::CleanupOnSignal(unsigned long) (/opt/compiler-explorer/clang-trunk/bin/clang+++0x3b0e35c)
 #2 0x0000000003a5dbc3 llvm::CrashRecoveryContext::HandleExit(int) (/opt/compiler-explorer/clang-trunk/bin/clang+++0x3a5dbc3)
 llvm#3 0x0000000003b05cfe llvm::sys::Process::Exit(int, bool) (/opt/compiler-explorer/clang-trunk/bin/clang+++0x3b05cfe)
 llvm#4 0x0000000000d4e3eb LLVMErrorHandler(void*, char const*, bool) cc1_main.cpp:0:0
 llvm#5 0x0000000003a67c93 llvm::report_fatal_error(llvm::Twine const&, bool) (/opt/compiler-explorer/clang-trunk/bin/clang+++0x3a67c93)
 llvm#6 0x0000000003a67df8 (/opt/compiler-explorer/clang-trunk/bin/clang+++0x3a67df8)
 llvm#7 0x0000000002549148 llvm::X86TargetMachine::X86TargetMachine(llvm::Target const&, llvm::Triple const&, llvm::StringRef, llvm::StringRef, llvm::TargetOptions const&, std::optional<llvm::Reloc::Model>, std::optional<llvm::CodeModel::Model>, llvm::CodeGenOptLevel, bool) (/opt/compiler-explorer/clang-trunk/bin/clang+++0x2549148)
 llvm#8 0x00000000025491fc llvm::RegisterTargetMachine<llvm::X86TargetMachine>::Allocator(llvm::Target const&, llvm::Triple const&, llvm::StringRef, llvm::StringRef, llvm::TargetOptions const&, std::optional<llvm::Reloc::Model>, std::optional<llvm::CodeModel::Model>, llvm::CodeGenOptLevel, bool) (/opt/compiler-explorer/clang-trunk/bin/clang+++0x25491fc)
 llvm#9 0x0000000003db74cc clang::emitBackendOutput(clang::CompilerInstance&, clang::CodeGenOptions&, llvm::StringRef, llvm::Module*, clang::BackendAction, llvm::IntrusiveRefCntPtr<llvm::vfs::FileSystem>, std::unique_ptr<llvm::raw_pwrite_stream, std::default_delete<llvm::raw_pwrite_stream>>, clang::BackendConsumer*) (/opt/compiler-explorer/clang-trunk/bin/clang+++0x3db74cc)
llvm#10 0x0000000004460d95 clang::BackendConsumer::HandleTranslationUnit(clang::ASTContext&) (/opt/compiler-explorer/clang-trunk/bin/clang+++0x4460d95)
llvm#11 0x00000000060005ec clang::ParseAST(clang::Sema&, bool, bool) (/opt/compiler-explorer/clang-trunk/bin/clang+++0x60005ec)
llvm#12 0x00000000044614b5 clang::CodeGenAction::ExecuteAction() (/opt/compiler-explorer/clang-trunk/bin/clang+++0x44614b5)
llvm#13 0x0000000004737121 clang::FrontendAction::Execute() (/opt/compiler-explorer/clang-trunk/bin/clang+++0x4737121)
llvm#14 0x00000000046b777b clang::CompilerInstance::ExecuteAction(clang::FrontendAction&) (/opt/compiler-explorer/clang-trunk/bin/clang+++0x46b777b)
llvm#15 0x00000000048229e3 clang::ExecuteCompilerInvocation(clang::CompilerInstance*) (/opt/compiler-explorer/clang-trunk/bin/clang+++0x48229e3)
llvm#16 0x0000000000d50621 cc1_main(llvm::ArrayRef<char const*>, char const*, void*) (/opt/compiler-explorer/clang-trunk/bin/clang+++0xd50621)
llvm#17 0x0000000000d48e2d ExecuteCC1Tool(llvm::SmallVectorImpl<char const*>&, llvm::ToolContext const&) driver.cpp:0:0
llvm#18 0x00000000044acc99 void llvm::function_ref<void ()>::callback_fn<clang::driver::CC1Command::Execute(llvm::ArrayRef<std::optional<llvm::StringRef>>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>*, bool*) const::'lambda'()>(long) Job.cpp:0:0
llvm#19 0x0000000003a5dac3 llvm::CrashRecoveryContext::RunSafely(llvm::function_ref<void ()>) (/opt/compiler-explorer/clang-trunk/bin/clang+++0x3a5dac3)
llvm#20 0x00000000044aceb9 clang::driver::CC1Command::Execute(llvm::ArrayRef<std::optional<llvm::StringRef>>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>*, bool*) const (.part.0) Job.cpp:0:0
llvm#21 0x00000000044710dd clang::driver::Compilation::ExecuteCommand(clang::driver::Command const&, clang::driver::Command const*&, bool) const (/opt/compiler-explorer/clang-trunk/bin/clang+++0x44710dd)
llvm#22 0x0000000004472071 clang::driver::Compilation::ExecuteJobs(clang::driver::JobList const&, llvm::SmallVectorImpl<std::pair<int, clang::driver::Command const*>>&, bool) const (/opt/compiler-explorer/clang-trunk/bin/clang+++0x4472071)
llvm#23 0x000000000447c3fc clang::driver::Driver::ExecuteCompilation(clang::driver::Compilation&, llvm::SmallVectorImpl<std::pair<int, clang::driver::Command const*>>&) (/opt/compiler-explorer/clang-trunk/bin/clang+++0x447c3fc)
llvm#24 0x0000000000d4d2b1 clang_main(int, char**, llvm::ToolContext const&) (/opt/compiler-explorer/clang-trunk/bin/clang+++0xd4d2b1)
llvm#25 0x0000000000c12464 main (/opt/compiler-explorer/clang-trunk/bin/clang+++0xc12464)
llvm#26 0x00007ae43b029d90 (/lib/x86_64-linux-gnu/libc.so.6+0x29d90)
llvm#27 0x00007ae43b029e40 __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x29e40)
llvm#28 0x0000000000d488c5 _start (/opt/compiler-explorer/clang-trunk/bin/clang+++0xd488c5)
```

---------

Co-authored-by: Shashwathi N <nshashwa@pe31.hpc.amslabs.hpecorp.net>
mshahneo pushed a commit that referenced this pull request Jun 13, 2025
…142952)

This was removed in llvm#135343 in
favour of making it a format variable, which we do here. This follows
the precedent of the `[opt]` and `[artificial]` markers.

Before:
```
 thread #1, queue = 'com.apple.main-thread', stop reason = breakpoint 1.2
 * frame #0: 0x000000010000037c a.out`inlined1() at inline.cpp:4:3
   frame #1: 0x000000010000037c a.out`regular() at inline.cpp:6:17
   frame #2: 0x00000001000003b8 a.out`inlined2() at inline.cpp:7:43
   frame llvm#3: 0x00000001000003b4 a.out`main at inline.cpp:10:3
   frame llvm#4: 0x0000000186345be4 dyld`start + 7040
```

After (note the `[inlined]` markers):
```
thread #1, queue = 'com.apple.main-thread', stop reason = breakpoint 1.2
* frame #0: 0x000000010000037c a.out`inlined1() at inline.cpp:4:3 [inlined]
  frame #1: 0x000000010000037c a.out`regular() at inline.cpp:6:17
  frame #2: 0x00000001000003b8 a.out`inlined2() at inline.cpp:7:43 [inlined]
  frame llvm#3: 0x00000001000003b4 a.out`main at inline.cpp:10:3
  frame llvm#4: 0x0000000186345be4 dyld`start + 7040
```

rdar://152642178
@mshahneo
Copy link
Owner Author

Hi Adam (@adam-smnk), Igor (@Garra1980 ),

Modified the uarch definition based on our discussion.
In this version the Pivot is the utilities which necessary instructions have to implement.
I also moved the information to C++ (as part of get functions directly instead of in yaml).

Please let me know what you think. I only added implementation for DPAS. Will add Load/store/prefetch if we agree on the process.

Thanks a lot in advance :).

P.S. I still kept some old code intentionally, if we want to go back or use some of it. I'll clean up once we agree on the process.

@mshahneo mshahneo changed the title uArch definition (PR 1/N) [XeGPU] uArch definition (PR 1/N) Jun 26, 2025
@mshahneo
Copy link
Owner Author

mshahneo commented Jun 26, 2025

@silee2 , @nbpatel , @chencha3 , @charithaintc , @Jianhui-Li, @akroviakov
please let me know if you have any comments/suggestions.
Thank you so much in advance :)

@Garra1980
Copy link
Collaborator

also cc @tkarna

@tkarna
Copy link

tkarna commented Jun 30, 2025

Do you think xegpu could at some point support the Data Layout and Target Information (DLTI) dialect for querying the uarch info?
FYI @rolfmorel

@rolfmorel
Copy link

rolfmorel commented Jun 30, 2025

Thanks for adding me, @tkarna ! Agreed, it would be great to collaborate on ways to expose this target info through DLTI.

I think @tkarna's work on performant schedules for Xe - which have plentiful magic values derivable from hw info - should be a good exemplar of a user of Xe's DLTI. Showing how to automatically derive those values through DLTI would be a great success indicator.

I had a quick attempt at what exposing this data through DLTI might look like (here using the transform dialect though you can also do the same queries from C++):

module attribute { #xevm.target<arch = "intel_gpu_pvc", ... = ?any other flags that vary across hardware?> } {
    ...
}
module {transform.with_named_sequence} {
    transform.named_sequence @__main__(%mod: !transform.any_op) {
        %number_of_eus_per_xe_core = transform.dlti.query ["xe_core","#eus"] at %mod  // #xevm.target answers 8
        %number_of_threads_per_eu = transform.dlti.query ["xe_core","eu","#threads"] at %mod  // #xevm.target answers [8, 4]
        
        %shared_memory_size = transform.dlti.query ["xe","shared_memory","size"] at %mod  // #xevm.target answers 524288
        %shared_memory_alignment = transform.dlti.query ["xe","shared_memory","alighment"] at %mod  // #xevm.target answers 64

        %grf_bitwidth = transform.dlti.query ["xe","grf","bitwidth"] at %mod  // #xevm.target answers 512
        %grf_modes = transform.dlti.query ["xe","grf","modes"] at %mod  // #xevm.target answers ["small", "large"]
        %grf_count_per_thread = transform.dlti.query ["xe","grf","mode","small","per_thread"] at %mod  // #xevm.target answers 128

        %dpas_opcode = transform.dlti.query ["xe","inst","dpas","opcode"] at %mod  // #xevm.target answers 0x83
        %dpas_systolic_depth = transform.dlti.query ["xe","inst","dpas","systolic_depth"] at %mod  // #xevm.target answers [8]
        %dpas_repeat_count = transform.dlti.query ["xe","inst","dpas","repeat_count"] at %mod  // #xevm.target answers 1,2,3,4,5,6,7,8 as multiple associations for single result handle OR [1, 8] OR #dlti.map<"min" = 1, "max" = 8> OR #dlti.range<1,2,...,8>/#tune.range<1,2,...,8> which would also respond to "min" and "max" queries,
        %dpas_ops_per_channel = transform.dlti.query ["xe","inst","dpas","ops_per_channel"] at %mod  // #xevm.target answers #dlti.map<19: 1, 16: 2, 8: 4, 4: 8, 2:8>
        
        %dpas_supported_types = transform.dlti.query ["xe","inst","dpas","supported_types"] at %mod  // #xevm.target answers #xevm.dpas_types so that ...
        %dpas_supported_types_B_wildcard = transform.dlti.query ["xe","inst","dpas","supported_types", f16, f16, f16, #dlti._] at %mod  // #xevm.dpas_types (yielded by #xevm.target which answered prefix) answers f16
        %dpas_supported_types_Dst_wildcard = transform.dlti.query ["xe","inst","dpas","supported_types", #dlti._, f16, f16, f16] at %mod  // #xevm.dpas_types (yielded by #xevm.target which answered prefix) answers f16,f32 (i.e. two type params associated to the handle)
        %dpas_supported_types_Dst_and_Acc_wildcard:2 = transform.dlti.query ["xe","inst","dpas","supported_types", #dlti._, #dlti._, f16, f16] at %mod  // #xevm.dpas_types (yielded by #xevm.target which answered prefix) answers f16,f32,f16,f32 for first result and f16,f16,f32,f32 for second result
        
        ... // transform ops would take the above Values as parameters
    }
}

That is, there's an #xevm.target attribute (or #xegpu.target) that handles most queries directly though it can also delegate to attributes like #xevm.dpas_types (in this case constructed by xevm.target) if so desired. This means there should be a XeTargetAttr implementing the DLTIQueryInterface's single query(..) method. We can discuss what's the best mechanism for implementing this method.

Not every query above is (easily) implemented with current upstream DLTI. I am working on a PR (oriented towards DLTI attributes that respond to queries by querying for target info and applying cost models) that will enable all the above that I will push as a draft this week.

@mshahneo
Copy link
Owner Author

mshahneo commented Jun 30, 2025

@tkarna , yes, we would like to.
Current plan was to utilize DLTI only to query/expose the target triple.

As @rolfmorel pointed out, the current DLTI is too verbose and simple, @adam-smnk & I spoke about using DLTI directly, the issue was mapping all the hardware info in the DLTI attribute was making the IR way too verbose and big (e.g., think the .yaml file in the atttibute list :().

Thanks @rolfmorel for the detailed example. I remember you said that you envision that a query can essentially take a C++ function as key. That's why in the current iteration, I tried to make the utilities a first class citizen, this way, any ops that implement these interfaces can be queried via DLTI:

%dpas_supported_M_sizes = transform.dlti.query ["xe","inst","dpas","get_suppported_M", "bf16"] at %mod // #xevm.target answers the possible M sizes for type bf16...

This would essentially call that specific uArch's implementation of getSupportedM() function.

I'd like to discuss more with you all on this.

Copy link
Collaborator

@adam-smnk adam-smnk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the general direction 👍
The overall concept to have common descriptors (the uArch.h) which can have platform specific implementations (PVC, battlemage variants etc.) is great.

I haven't paid too close attention to exact implementation yet. However, it is a bit hard to really judge these helper APIs and their exact abstraction right now. I think it might be best to just start with a small prototype as soon as possible and just iterate on the go.

@chencha3 I think the on-going XeGPU verifier relaxation could great opportunity to run more hands-on experiments with this infrastructure. Perhaps a small uArch-aware validation pass for XeGPU or XeVM ops?

@tkarna If you could share what key hardware properties you already use for tuning or which further info could help you in that, it'd be super valuable feedback to help steer this design.

Comment on lines 37 to 38
uint no_of_dims;
std::vector<uint> dims;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What exactly is the meaning of no_of_dims.
In this case, if dims.size() is different than no_of_dims then what happens?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, we could just remove no_of_dims and get it from dims.size().
Maybe remove the class itself. Let me see.

@tkarna
Copy link

tkarna commented Jul 1, 2025

@tkarna If you could share what key hardware properties you already use for tuning or which further info could help you in that, it'd be super valuable feedback to help steer this design.

In my preliminary tests I've use the following parameters for 4k matmul benchmark.

Fixed parameters:

  • DPAS tile sizes (n, m, k)

Tunable parameters:

  • WG tile (n, m); SG tile (n, m); K tile size
  • Prefetch tile sizes for A and B
  • Block (load) tile sizes for A and B

Implemented constrains: tile size must not exceed the parent tile size and divide parent tile evenly. SG tiling must not exceed max number of threads (32). Prefetch tile size must be compatible with the SG level tread configuration (coop prefetching).

There are other hardware constraints I have not explicitly expressed yet, for example maximum/supported load tile size and VNNI packing constraints for B. Some configs do spill registers and run slower.

@mshahneo
Copy link
Owner Author

mshahneo commented Jul 1, 2025

I like the general direction 👍 The overall concept to have common descriptors (the uArch.h) which can have platform specific implementations (PVC, battlemage variants etc.) is great.

I haven't paid too close attention to exact implementation yet. However, it is a bit hard to really judge these helper APIs and their exact abstraction right now. I think it might be best to just start with a small prototype as soon as possible and just iterate on the go.

@chencha3 I think the on-going XeGPU verifier relaxation could great opportunity to run more hands-on experiments with this infrastructure. Perhaps a small uArch-aware validation pass for XeGPU or XeVM ops?

@tkarna If you could share what key hardware properties you already use for tuning or which further info could help you in that, it'd be super valuable feedback to help steer this design.

Thank you so much, @adam-smnk for the feedback :-) .
I am working on the prototype, I should be able push the changes today.

@mshahneo mshahneo force-pushed the uarch_definition branch 2 times, most recently from 042311e to 94ae51e Compare July 2, 2025 05:05
mshahneo pushed a commit that referenced this pull request Jul 2, 2025
The function already exposes a work list to avoid deep recursion, this
commit starts utilizing it in a helper that could also lead to a deep
recursion.

We have observed this crash on `clang/test/C/C99/n590.c` with our
internal builds that enable aggressive optimizations and hit the limit
earlier than default release builds of Clang.

See the added test for an example with a deeper recursion that used to
crash in upstream Clang before this change with the following stack
trace:

```
  #0 llvm::sys::PrintStackTrace(llvm::raw_ostream&, int) /usr/local/google/home/ibiryukov/code/llvm-project/llvm/lib/Support/Unix/Signals.inc:804:13
  #1 llvm::sys::RunSignalHandlers() /usr/local/google/home/ibiryukov/code/llvm-project/llvm/lib/Support/Signals.cpp:106:18
  #2 SignalHandler(int, siginfo_t*, void*) /usr/local/google/home/ibiryukov/code/llvm-project/llvm/lib/Support/Unix/Signals.inc:0:3
  llvm#3 (/lib/x86_64-linux-gnu/libc.so.6+0x3fdf0)
  llvm#4 AnalyzeImplicitConversions(clang::Sema&, clang::Expr*, clang::SourceLocation, bool) /usr/local/google/home/ibiryukov/code/llvm-project/clang/lib/Sema/SemaChecking.cpp:12772:0
  llvm#5 CheckCommaOperand /usr/local/google/home/ibiryukov/code/llvm-project/clang/lib/Sema/SemaChecking.cpp:0:3
  llvm#6 AnalyzeImplicitConversions /usr/local/google/home/ibiryukov/code/llvm-project/clang/lib/Sema/SemaChecking.cpp:12644:7
  llvm#7 AnalyzeImplicitConversions(clang::Sema&, clang::Expr*, clang::SourceLocation, bool) /usr/local/google/home/ibiryukov/code/llvm-project/clang/lib/Sema/SemaChecking.cpp:12776:5
  llvm#8 CheckCommaOperand /usr/local/google/home/ibiryukov/code/llvm-project/clang/lib/Sema/SemaChecking.cpp:0:3
  llvm#9 AnalyzeImplicitConversions /usr/local/google/home/ibiryukov/code/llvm-project/clang/lib/Sema/SemaChecking.cpp:12644:7
 llvm#10 AnalyzeImplicitConversions(clang::Sema&, clang::Expr*, clang::SourceLocation, bool) /usr/local/google/home/ibiryukov/code/llvm-project/clang/lib/Sema/SemaChecking.cpp:12776:5
 llvm#11 CheckCommaOperand /usr/local/google/home/ibiryukov/code/llvm-project/clang/lib/Sema/SemaChecking.cpp:0:3
 llvm#12 AnalyzeImplicitConversions /usr/local/google/home/ibiryukov/code/llvm-project/clang/lib/Sema/SemaChecking.cpp:12644:7
 llvm#13 AnalyzeImplicitConversions(clang::Sema&, clang::Expr*, clang::SourceLocation, bool) /usr/local/google/home/ibiryukov/code/llvm-project/clang/lib/Sema/SemaChecking.cpp:12776:5
 llvm#14 CheckCommaOperand /usr/local/google/home/ibiryukov/code/llvm-project/clang/lib/Sema/SemaChecking.cpp:0:3
 llvm#15 AnalyzeImplicitConversions /usr/local/google/home/ibiryukov/code/llvm-project/clang/lib/Sema/SemaChecking.cpp:12644:7
 llvm#16 AnalyzeImplicitConversions(clang::Sema&, clang::Expr*, clang::SourceLocation, bool) /usr/local/google/home/ibiryukov/code/llvm-project/clang/lib/Sema/SemaChecking.cpp:12776:5
 llvm#17 CheckCommaOperand /usr/local/google/home/ibiryukov/code/llvm-project/clang/lib/Sema/SemaChecking.cpp:0:3
 llvm#18 AnalyzeImplicitConversions /usr/local/google/home/ibiryukov/code/llvm-project/clang/lib/Sema/SemaChecking.cpp:12644:7
 llvm#19 AnalyzeImplicitConversions(clang::Sema&, clang::Expr*, clang::SourceLocation, bool) /usr/local/google/home/ibiryukov/code/llvm-project/clang/lib/Sema/SemaChecking.cpp:12776:5
... 700+ more stack frames.
```
@mshahneo mshahneo changed the base branch from main to dummy_main July 2, 2025 05:08
mshahneo added 4 commits July 2, 2025 05:10
This version focuses on the utilities to be the pivot.
It also saves info directly in C++ files as part of get functions.
Don't use the yamls anymore.

Adds support for DPAS instruction.
Remove compiler warnings and errors.
@mshahneo mshahneo force-pushed the uarch_definition branch from 94ae51e to af7098b Compare July 2, 2025 05:11
@mshahneo mshahneo deleted the branch main July 2, 2025 05:12
@mshahneo mshahneo closed this Jul 2, 2025
@mshahneo mshahneo reopened this Jul 2, 2025
@mshahneo mshahneo changed the base branch from dummy_main to main July 2, 2025 05:17
@mshahneo
Copy link
Owner Author

mshahneo commented Jul 2, 2025

Hi @adam-smnk , @rolfmorel , @tkarna , @Jianhui-Li, @silee2, @chencha3 , @charithaintc,

I added modified the PR to add a prototype. Please take a look at your conveniences.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potentially, yes.

@@ -577,6 +579,45 @@ LogicalResult DpasOp::verify() {
if (getAcc() && getAcc().getType() != getResultType())
return emitOpError("Expecting the acc type to be the same as result.");

// @uArch: Check if the types are supported for DPAS.
Operation *op = getOperation();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this piece can be hidden inside uArch utilities so client just say smth like
auto res = verify(op, "dpas");
?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue with this is that it requires the utility to have all the parameters for of the op.
Alternatively, the op would have to implement the same interface so that the uArch utility can get the parameters.

@@ -0,0 +1,14 @@
module @eltwise_add attributes {gpu.container_module} {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems the test checks are missing here.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, sorry Chao, this test is supposed to be removed. The other 2 tests covers the mechanism. I pushed it initially forgot to remove it.

"xegpu::XeGPUDialect", "mlir::DLTIDialect"
];
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to raise the following discussion points:

  1. NV and AMD are putting nvvm-attach-target and rocdl-attach-target under the GPU dialect. Not sure whether it is better for us to follow their way too.
  2. They are attached to GPUModuleOp, a finer granularity than us. It may be a requirement of GPU dialect. A potential benefit is that they may be able to support the case that a ModuleOp containing multiple GPUModuleOp, with each GPUMoudleOp map to different GPUs from different vendors.
  3. They used NVVMTargetAttr and ROCDLTargetAttr to store the info. Now, we have XeVMTargetAttr, can we reuse it here too? Considering we only capture the device name here, which is also captured by XeVMTargetAttr?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sang Ik created this PR: llvm#147372. is it similar to the pass here?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, this appears similar to the xevm's attach target pass that is yet to be merged. After the merge, the chip string of the xevm target should be appropriate to query uarch info:

    auto targetDeviceArch =
        mlir::xegpu::uArch::uArchMap::instance().get(targetDeviceNameStr);

Copy link

@rolfmorel rolfmorel Jul 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do the XeVM targets have a sensible notion of a LLVM triple along side that of chip? How about (LLVM) features? If so, it would probably make sense to have a #xevm.target attribute, a la the proposed above, which implements the LLVMTargetAttrInterface from this WIP upstream PR:
https://github.com/llvm/llvm-project/pull/145899/files#diff-6c2503d165a7390c955c3c4fa4a76fd1991633b5ec597a1b1fd92731e1f3684dR572

I am in the process of making #nvvm.target and #rocdl.target attributes implement this interface (in addition to the generic #llvm.target attr in that PR). If sensible, the XeVM dialect should probably (try to) mirror this.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given this in https://github.com/llvm/llvm-project/pull/147372/files#diff-3da7b8747032bb581f119fd49037d06706443c3c5db56d4d28cc74053a9b754dR284 in the Add xevm-attach-target transform pass PR, I take it there's a sensible triple.

I will have a look at making xevm.target also implement my interface.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Chao, Rolf, Artem.

You are right once XeVM target attribute gets upstreamed, we should use it. This pass is here just to show the use case.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

On that note, I found that the #xevm.target attr itself already got merged: https://github.com/llvm/llvm-project/blob/b9d7513bf134febe72c05a04ff20f87191d7213a/mlir/include/mlir/Dialect/LLVMIR/XeVMOps.td#L521

We can get cracking!

dpas:
instruction_name: dpas
instruction_description: "Dot Product Accumulate Systolic (DPAS) is a matrix multiply-add operation"
instruction_opcode: "0x83"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why opcode ?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added some extra fields in the structs in case they become useful in the future, they are not useful in our current use-case. I can remove them if we don't think they are necessary.

// no_of_dims: 2
// dim: [2, 2]
// This represents a 2x2 tile
struct Tile {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tile is overloaded term. It looks to me just a shape. Does there any pre-defined data type you can reuse?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, Jianhui. I am not using it in the current version of the implementation. I kept it if we want to use it in some fashion later.

The current implementation uses std::vector.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mshahneo Looks like we should remove older version of the interface and keep the one we want to proceed with

// - [2, 16]
// This represents a 2x2 RangeTile where the first dimension can have values
// from 1 to 32 and the second dimension can have values from 2 to 16
struct RangeTile {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not just use DiscreteTile, if this is not used that often.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kept two different ways to specify the range.

// dims:
// - [1, 2, 4, 8, 16, 32]
// - [2, 4, 8, 16]
// This represents a 2x2 DiscreteTile where the first dimension can have values
Copy link

@Jianhui-Li Jianhui-Li Jul 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that DiscreteTile represents shapes combined from lists of sizes. How about name them ShapeCombination or something more specific to our problem and less general?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense. Current implementation actually uses std::vector.

std::tuple<Args...> data;
std::function<bool(Args...)> func;

Restriction(Args... args, std::function<bool(Args...)> f)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The term "restriction" is also very generic. Please consider making it specific. Validate the instance belongs to a combination?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Restriction is supposed to be model all restriction.
And a common validate() could just apply all the restrictions. However, we can always go back to specific validations. I'll think about this.

return {1, 2, 3, 4, 5, 6, 7, 8};
}

std::vector<uint32_t> DPASInstruction::getSupportedK(mlir::Type type) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am a bit confused. I thought you will read the information from dlti, not hard-coded inside the code.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These hardware properties are static so I'd say they should be hardcoded somewhere.
Whether you can extract them from LLVM, inside DLTI query function, or as a separate helper like this (or IMEX's target info) is matter of what's most convenient.

I don't think we should be dumping whole hardware specification into IR as DLTI attributes.

llvm::errs() << "No parent module op.\n";

// It target device info is not attched, skip the target-specific checks
auto targetDeviceNameAttr = dlti::query(moduleOp, {"GPU", "name"});

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should discuss where to check the uArch info. The uArch info could be checked in the lowering passes, and the op definition should check general constraints (like a.m = c.m).

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually the way the uArch info is set up. Once the target is attached, it can be queried from everywhere (pass and dialect).


---
# The target architecture name.
arch: intel_gpu_pvc

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how is this file being loaded?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's an older artifact, we shouldn't have yaml at all.

instruction_description: "2D Block Load instruction reads a rectangular block of memory in to GRF."
instruction_opcode: nan
instruction_functional_unit: load
instruction_type: simd

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

General: I don't think the yaml should write following the terminology in the internal uArch spec for PVC (like simd, systolic_depth, repeat_count, vnni ...). Instead, it should written following the SPIRV extnesion (or the OCL C function) (no mentioning of simd, using m/n/k, packing).

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Jianhui,

We are not using the YAML in the current version.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants