-
Notifications
You must be signed in to change notification settings - Fork 2
[XeGPU] uArch definition (PR 1/N) #2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
4666e33
to
5dd0ebe
Compare
} // namespace mlir | ||
|
||
#endif // MLIR_DIALECT_XEGPU_UTILS_INTEL_GPU_PVC_H | ||
//===--- IntelGpuPVC.h ---------------------------------------*- C++ -*-===// |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this one should be on the top I guess
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, yes, I think I auto-added by co-pilot suggestion. Will remove it :)
A general question on the structure - what's the benefit of maintaining both yaml and C++ definitions? |
Range systolic_depth; | ||
Range repreat_count; | ||
Range execution_size; | ||
std::map<std::string, uint> ops_per_channel; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Opaque string mapping doesn't feel user friendly. Especially as it seems to be wrapping numerical bit width that could be a numerical key directly - I assume it's some limitation due to using (conversion from) yaml.
matrix_size
is even worse offender in that regard. These wrappers are hard to use without yaml definitions which again makes me wonder if we really need to split it into two separate files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Opaque string mapping doesn't feel user friendly. Especially as it seems to be wrapping numerical bit width that could be a numerical key directly - I assume it's some limitation due to using (conversion from) yaml.
You are right, the limitation is due to the yaml mapping.
matrix_size is even worse offender in that regard. These wrappers are hard to use without yaml definitions which again makes me wonder if we really need to split it into two separate files.
Sorry, the matrix size is a mistake. The value should be a vector of uint. But yes, in general you are right, the structs and and yaml to some extent have to be used in conjunction.
The C++ structure actually comes out of a necessity. To use yaml mapping utlity in LLVM, we have to have a C++ object mapped to it, hence the structures. And since we need the C++ structures anyway, I wanted to make some of them to be re-usable across uArchs.
Not really, not in this context anyway. But we wanted to keep the base structs in a such a way that they can be used in a non-LLVM cases, but that's more of hope not necessasity.
I thought about this, but it takes me back to tablegen. Should we use tablegen? |
It feels like the overhead of this approach in its complexity and possible maintenance cost is not really justified. On its own it seems fine but these C++ bindings are not great. 😅
It's worth a try if what we need easily translates to tablegen entries and structure. That is, if you find yourself at any point fighting against tablegen or in need to introduce custom "hacks", I would not necessarily double down on it. I would suggest starting with user interfaces to flesh out what's really needed. Then it should become clearer where to store all the uarch details e.g., embed shared memory size directly in Next quick test would be to see how well it all holds up for another target like battlemage. |
Thanks, Adam. I think that's a good idea. Let me try to collect the uArch info battlemage. I looked into them during the design, but more details may help us finding the approach that would be easy to maintain. |
llvm#138091) Check this error for more context (https://github.com/compiler-research/CppInterOp/actions/runs/14749797085/job/41407625681?pr=491#step:10:531) This fails with ``` * thread #1, name = 'CppInterOpTests', stop reason = signal SIGSEGV: address not mapped to object (fault address: 0x55500356d6d3) * frame #0: 0x00007fffee41cfe3 libclangCppInterOp.so.21.0gitclang::PragmaNamespace::~PragmaNamespace() + 99 frame #1: 0x00007fffee435666 libclangCppInterOp.so.21.0gitclang::Preprocessor::~Preprocessor() + 3830 frame #2: 0x00007fffee20917a libclangCppInterOp.so.21.0gitstd::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() + 58 frame llvm#3: 0x00007fffee224796 libclangCppInterOp.so.21.0gitclang::CompilerInstance::~CompilerInstance() + 838 frame llvm#4: 0x00007fffee22494d libclangCppInterOp.so.21.0gitclang::CompilerInstance::~CompilerInstance() + 13 frame llvm#5: 0x00007fffed95ec62 libclangCppInterOp.so.21.0gitclang::IncrementalCUDADeviceParser::~IncrementalCUDADeviceParser() + 98 frame llvm#6: 0x00007fffed9551b6 libclangCppInterOp.so.21.0gitclang::Interpreter::~Interpreter() + 102 frame llvm#7: 0x00007fffed95598d libclangCppInterOp.so.21.0gitclang::Interpreter::~Interpreter() + 13 frame llvm#8: 0x00007fffed9181e7 libclangCppInterOp.so.21.0gitcompat::createClangInterpreter(std::vector<char const*, std::allocator<char const*>>&) + 2919 ``` Problem : 1) The destructor currently handles no clearance for the DeviceParser and the DeviceAct. We currently only have this https://github.com/llvm/llvm-project/blob/976493822443c52a71ed3c67aaca9a555b20c55d/clang/lib/Interpreter/Interpreter.cpp#L416-L419 2) The ownership for DeviceCI currently is present in IncrementalCudaDeviceParser. But this should be similar to how the combination for hostCI, hostAction and hostParser are managed by the Interpreter. As on master the DeviceAct and DeviceParser are managed by the Interpreter but not DeviceCI. This is problematic because : IncrementalParser holds a Sema& which points into the DeviceCI. On master, DeviceCI is destroyed before the base class ~IncrementalParser() runs, causing Parser::reset() to access a dangling Sema (and as Sema holds a reference to Preprocessor which owns PragmaNamespace) we see this ``` * frame #0: 0x00007fffee41cfe3 libclangCppInterOp.so.21.0gitclang::PragmaNamespace::~PragmaNamespace() + 99 frame #1: 0x00007fffee435666 libclangCppInterOp.so.21.0gitclang::Preprocessor::~Preprocessor() + 3830 ```
Fix for: `Assertion failed: (false && "Architecture or OS not supported"), function CreateRegisterContextForFrame, file /usr/src/contrib/llvm-project/lldb/source/Plugins/Process/elf-core/ThreadElfCore.cpp, line 182. PLEASE submit a bug report to https://bugs.freebsd.org/submit/ and include the crash backtrace. #0 0x000000080cd857c8 llvm::sys::PrintStackTrace(llvm::raw_ostream&, int) /usr/src/contrib/llvm-project/llvm/lib/Support/Unix/Signals.inc:723:13 #1 0x000000080cd85ed4 /usr/src/contrib/llvm-project/llvm/lib/Support/Unix/Signals.inc:797:3 #2 0x000000080cd82ae8 llvm::sys::RunSignalHandlers() /usr/src/contrib/llvm-project/llvm/lib/Support/Signals.cpp:104:5 llvm#3 0x000000080cd861f0 SignalHandler /usr/src/contrib/llvm-project/llvm/lib/Support/Unix/Signals.inc:403:3 llvm#4 0x000000080f159644 handle_signal /usr/src/lib/libthr/thread/thr_sig.c:298:3 `
The mcmodel=tiny memory model is only valid on ARM targets. While trying this on X86 compiler throws an internal error along with stack dump. llvm#125641 This patch resolves the issue. Reduced test case: ``` #include <stdio.h> int main( void ) { printf( "Hello, World!\n" ); return 0; } ``` ``` 0. Program arguments: /opt/compiler-explorer/clang-trunk/bin/clang++ -gdwarf-4 -g -o /app/output.s -fno-verbose-asm -S --gcc-toolchain=/opt/compiler-explorer/gcc-snapshot -fcolor-diagnostics -fno-crash-diagnostics -mcmodel=tiny <source> 1. <eof> parser at end of file #0 0x0000000003b10218 llvm::sys::PrintStackTrace(llvm::raw_ostream&, int) (/opt/compiler-explorer/clang-trunk/bin/clang+++0x3b10218) #1 0x0000000003b0e35c llvm::sys::CleanupOnSignal(unsigned long) (/opt/compiler-explorer/clang-trunk/bin/clang+++0x3b0e35c) #2 0x0000000003a5dbc3 llvm::CrashRecoveryContext::HandleExit(int) (/opt/compiler-explorer/clang-trunk/bin/clang+++0x3a5dbc3) llvm#3 0x0000000003b05cfe llvm::sys::Process::Exit(int, bool) (/opt/compiler-explorer/clang-trunk/bin/clang+++0x3b05cfe) llvm#4 0x0000000000d4e3eb LLVMErrorHandler(void*, char const*, bool) cc1_main.cpp:0:0 llvm#5 0x0000000003a67c93 llvm::report_fatal_error(llvm::Twine const&, bool) (/opt/compiler-explorer/clang-trunk/bin/clang+++0x3a67c93) llvm#6 0x0000000003a67df8 (/opt/compiler-explorer/clang-trunk/bin/clang+++0x3a67df8) llvm#7 0x0000000002549148 llvm::X86TargetMachine::X86TargetMachine(llvm::Target const&, llvm::Triple const&, llvm::StringRef, llvm::StringRef, llvm::TargetOptions const&, std::optional<llvm::Reloc::Model>, std::optional<llvm::CodeModel::Model>, llvm::CodeGenOptLevel, bool) (/opt/compiler-explorer/clang-trunk/bin/clang+++0x2549148) llvm#8 0x00000000025491fc llvm::RegisterTargetMachine<llvm::X86TargetMachine>::Allocator(llvm::Target const&, llvm::Triple const&, llvm::StringRef, llvm::StringRef, llvm::TargetOptions const&, std::optional<llvm::Reloc::Model>, std::optional<llvm::CodeModel::Model>, llvm::CodeGenOptLevel, bool) (/opt/compiler-explorer/clang-trunk/bin/clang+++0x25491fc) llvm#9 0x0000000003db74cc clang::emitBackendOutput(clang::CompilerInstance&, clang::CodeGenOptions&, llvm::StringRef, llvm::Module*, clang::BackendAction, llvm::IntrusiveRefCntPtr<llvm::vfs::FileSystem>, std::unique_ptr<llvm::raw_pwrite_stream, std::default_delete<llvm::raw_pwrite_stream>>, clang::BackendConsumer*) (/opt/compiler-explorer/clang-trunk/bin/clang+++0x3db74cc) llvm#10 0x0000000004460d95 clang::BackendConsumer::HandleTranslationUnit(clang::ASTContext&) (/opt/compiler-explorer/clang-trunk/bin/clang+++0x4460d95) llvm#11 0x00000000060005ec clang::ParseAST(clang::Sema&, bool, bool) (/opt/compiler-explorer/clang-trunk/bin/clang+++0x60005ec) llvm#12 0x00000000044614b5 clang::CodeGenAction::ExecuteAction() (/opt/compiler-explorer/clang-trunk/bin/clang+++0x44614b5) llvm#13 0x0000000004737121 clang::FrontendAction::Execute() (/opt/compiler-explorer/clang-trunk/bin/clang+++0x4737121) llvm#14 0x00000000046b777b clang::CompilerInstance::ExecuteAction(clang::FrontendAction&) (/opt/compiler-explorer/clang-trunk/bin/clang+++0x46b777b) llvm#15 0x00000000048229e3 clang::ExecuteCompilerInvocation(clang::CompilerInstance*) (/opt/compiler-explorer/clang-trunk/bin/clang+++0x48229e3) llvm#16 0x0000000000d50621 cc1_main(llvm::ArrayRef<char const*>, char const*, void*) (/opt/compiler-explorer/clang-trunk/bin/clang+++0xd50621) llvm#17 0x0000000000d48e2d ExecuteCC1Tool(llvm::SmallVectorImpl<char const*>&, llvm::ToolContext const&) driver.cpp:0:0 llvm#18 0x00000000044acc99 void llvm::function_ref<void ()>::callback_fn<clang::driver::CC1Command::Execute(llvm::ArrayRef<std::optional<llvm::StringRef>>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>*, bool*) const::'lambda'()>(long) Job.cpp:0:0 llvm#19 0x0000000003a5dac3 llvm::CrashRecoveryContext::RunSafely(llvm::function_ref<void ()>) (/opt/compiler-explorer/clang-trunk/bin/clang+++0x3a5dac3) llvm#20 0x00000000044aceb9 clang::driver::CC1Command::Execute(llvm::ArrayRef<std::optional<llvm::StringRef>>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>*, bool*) const (.part.0) Job.cpp:0:0 llvm#21 0x00000000044710dd clang::driver::Compilation::ExecuteCommand(clang::driver::Command const&, clang::driver::Command const*&, bool) const (/opt/compiler-explorer/clang-trunk/bin/clang+++0x44710dd) llvm#22 0x0000000004472071 clang::driver::Compilation::ExecuteJobs(clang::driver::JobList const&, llvm::SmallVectorImpl<std::pair<int, clang::driver::Command const*>>&, bool) const (/opt/compiler-explorer/clang-trunk/bin/clang+++0x4472071) llvm#23 0x000000000447c3fc clang::driver::Driver::ExecuteCompilation(clang::driver::Compilation&, llvm::SmallVectorImpl<std::pair<int, clang::driver::Command const*>>&) (/opt/compiler-explorer/clang-trunk/bin/clang+++0x447c3fc) llvm#24 0x0000000000d4d2b1 clang_main(int, char**, llvm::ToolContext const&) (/opt/compiler-explorer/clang-trunk/bin/clang+++0xd4d2b1) llvm#25 0x0000000000c12464 main (/opt/compiler-explorer/clang-trunk/bin/clang+++0xc12464) llvm#26 0x00007ae43b029d90 (/lib/x86_64-linux-gnu/libc.so.6+0x29d90) llvm#27 0x00007ae43b029e40 __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x29e40) llvm#28 0x0000000000d488c5 _start (/opt/compiler-explorer/clang-trunk/bin/clang+++0xd488c5) ``` --------- Co-authored-by: Shashwathi N <nshashwa@pe31.hpc.amslabs.hpecorp.net>
…142952) This was removed in llvm#135343 in favour of making it a format variable, which we do here. This follows the precedent of the `[opt]` and `[artificial]` markers. Before: ``` thread #1, queue = 'com.apple.main-thread', stop reason = breakpoint 1.2 * frame #0: 0x000000010000037c a.out`inlined1() at inline.cpp:4:3 frame #1: 0x000000010000037c a.out`regular() at inline.cpp:6:17 frame #2: 0x00000001000003b8 a.out`inlined2() at inline.cpp:7:43 frame llvm#3: 0x00000001000003b4 a.out`main at inline.cpp:10:3 frame llvm#4: 0x0000000186345be4 dyld`start + 7040 ``` After (note the `[inlined]` markers): ``` thread #1, queue = 'com.apple.main-thread', stop reason = breakpoint 1.2 * frame #0: 0x000000010000037c a.out`inlined1() at inline.cpp:4:3 [inlined] frame #1: 0x000000010000037c a.out`regular() at inline.cpp:6:17 frame #2: 0x00000001000003b8 a.out`inlined2() at inline.cpp:7:43 [inlined] frame llvm#3: 0x00000001000003b4 a.out`main at inline.cpp:10:3 frame llvm#4: 0x0000000186345be4 dyld`start + 7040 ``` rdar://152642178
Hi Adam (@adam-smnk), Igor (@Garra1980 ), Modified the uarch definition based on our discussion. Please let me know what you think. I only added implementation for DPAS. Will add Load/store/prefetch if we agree on the process. Thanks a lot in advance :). P.S. I still kept some old code intentionally, if we want to go back or use some of it. I'll clean up once we agree on the process. |
@silee2 , @nbpatel , @chencha3 , @charithaintc , @Jianhui-Li, @akroviakov |
also cc @tkarna |
Do you think xegpu could at some point support the Data Layout and Target Information (DLTI) dialect for querying the uarch info? |
Thanks for adding me, @tkarna ! Agreed, it would be great to collaborate on ways to expose this target info through DLTI. I think @tkarna's work on performant schedules for Xe - which have plentiful magic values derivable from hw info - should be a good exemplar of a user of Xe's DLTI. Showing how to automatically derive those values through DLTI would be a great success indicator. I had a quick attempt at what exposing this data through DLTI might look like (here using the transform dialect though you can also do the same queries from C++): module attribute { #xevm.target<arch = "intel_gpu_pvc", ... = ?any other flags that vary across hardware?> } {
...
}
module {transform.with_named_sequence} {
transform.named_sequence @__main__(%mod: !transform.any_op) {
%number_of_eus_per_xe_core = transform.dlti.query ["xe_core","#eus"] at %mod // #xevm.target answers 8
%number_of_threads_per_eu = transform.dlti.query ["xe_core","eu","#threads"] at %mod // #xevm.target answers [8, 4]
%shared_memory_size = transform.dlti.query ["xe","shared_memory","size"] at %mod // #xevm.target answers 524288
%shared_memory_alignment = transform.dlti.query ["xe","shared_memory","alighment"] at %mod // #xevm.target answers 64
%grf_bitwidth = transform.dlti.query ["xe","grf","bitwidth"] at %mod // #xevm.target answers 512
%grf_modes = transform.dlti.query ["xe","grf","modes"] at %mod // #xevm.target answers ["small", "large"]
%grf_count_per_thread = transform.dlti.query ["xe","grf","mode","small","per_thread"] at %mod // #xevm.target answers 128
%dpas_opcode = transform.dlti.query ["xe","inst","dpas","opcode"] at %mod // #xevm.target answers 0x83
%dpas_systolic_depth = transform.dlti.query ["xe","inst","dpas","systolic_depth"] at %mod // #xevm.target answers [8]
%dpas_repeat_count = transform.dlti.query ["xe","inst","dpas","repeat_count"] at %mod // #xevm.target answers 1,2,3,4,5,6,7,8 as multiple associations for single result handle OR [1, 8] OR #dlti.map<"min" = 1, "max" = 8> OR #dlti.range<1,2,...,8>/#tune.range<1,2,...,8> which would also respond to "min" and "max" queries,
%dpas_ops_per_channel = transform.dlti.query ["xe","inst","dpas","ops_per_channel"] at %mod // #xevm.target answers #dlti.map<19: 1, 16: 2, 8: 4, 4: 8, 2:8>
%dpas_supported_types = transform.dlti.query ["xe","inst","dpas","supported_types"] at %mod // #xevm.target answers #xevm.dpas_types so that ...
%dpas_supported_types_B_wildcard = transform.dlti.query ["xe","inst","dpas","supported_types", f16, f16, f16, #dlti._] at %mod // #xevm.dpas_types (yielded by #xevm.target which answered prefix) answers f16
%dpas_supported_types_Dst_wildcard = transform.dlti.query ["xe","inst","dpas","supported_types", #dlti._, f16, f16, f16] at %mod // #xevm.dpas_types (yielded by #xevm.target which answered prefix) answers f16,f32 (i.e. two type params associated to the handle)
%dpas_supported_types_Dst_and_Acc_wildcard:2 = transform.dlti.query ["xe","inst","dpas","supported_types", #dlti._, #dlti._, f16, f16] at %mod // #xevm.dpas_types (yielded by #xevm.target which answered prefix) answers f16,f32,f16,f32 for first result and f16,f16,f32,f32 for second result
... // transform ops would take the above Values as parameters
}
} That is, there's an Not every query above is (easily) implemented with current upstream DLTI. I am working on a PR (oriented towards DLTI attributes that respond to queries by querying for target info and applying cost models) that will enable all the above that I will push as a draft this week. |
@tkarna , yes, we would like to. As @rolfmorel pointed out, the current DLTI is too verbose and simple, @adam-smnk & I spoke about using DLTI directly, the issue was mapping all the hardware info in the DLTI attribute was making the IR way too verbose and big (e.g., think the .yaml file in the atttibute list :(). Thanks @rolfmorel for the detailed example. I remember you said that you envision that a query can essentially take a C++ function as key. That's why in the current iteration, I tried to make the utilities a first class citizen, this way, any ops that implement these interfaces can be queried via DLTI:
This would essentially call that specific uArch's implementation of getSupportedM() function. I'd like to discuss more with you all on this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the general direction 👍
The overall concept to have common descriptors (the uArch.h
) which can have platform specific implementations (PVC, battlemage variants etc.) is great.
I haven't paid too close attention to exact implementation yet. However, it is a bit hard to really judge these helper APIs and their exact abstraction right now. I think it might be best to just start with a small prototype as soon as possible and just iterate on the go.
@chencha3 I think the on-going XeGPU verifier relaxation could great opportunity to run more hands-on experiments with this infrastructure. Perhaps a small uArch-aware validation pass for XeGPU or XeVM ops?
@tkarna If you could share what key hardware properties you already use for tuning or which further info could help you in that, it'd be super valuable feedback to help steer this design.
uint no_of_dims; | ||
std::vector<uint> dims; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What exactly is the meaning of no_of_dims
.
In this case, if dims.size()
is different than no_of_dims
then what happens?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right, we could just remove no_of_dims and get it from dims.size().
Maybe remove the class itself. Let me see.
In my preliminary tests I've use the following parameters for 4k matmul benchmark. Fixed parameters:
Tunable parameters:
Implemented constrains: tile size must not exceed the parent tile size and divide parent tile evenly. SG tiling must not exceed max number of threads (32). Prefetch tile size must be compatible with the SG level tread configuration (coop prefetching). There are other hardware constraints I have not explicitly expressed yet, for example maximum/supported load tile size and VNNI packing constraints for B. Some configs do spill registers and run slower. |
Thank you so much, @adam-smnk for the feedback :-) . |
042311e
to
94ae51e
Compare
The function already exposes a work list to avoid deep recursion, this commit starts utilizing it in a helper that could also lead to a deep recursion. We have observed this crash on `clang/test/C/C99/n590.c` with our internal builds that enable aggressive optimizations and hit the limit earlier than default release builds of Clang. See the added test for an example with a deeper recursion that used to crash in upstream Clang before this change with the following stack trace: ``` #0 llvm::sys::PrintStackTrace(llvm::raw_ostream&, int) /usr/local/google/home/ibiryukov/code/llvm-project/llvm/lib/Support/Unix/Signals.inc:804:13 #1 llvm::sys::RunSignalHandlers() /usr/local/google/home/ibiryukov/code/llvm-project/llvm/lib/Support/Signals.cpp:106:18 #2 SignalHandler(int, siginfo_t*, void*) /usr/local/google/home/ibiryukov/code/llvm-project/llvm/lib/Support/Unix/Signals.inc:0:3 llvm#3 (/lib/x86_64-linux-gnu/libc.so.6+0x3fdf0) llvm#4 AnalyzeImplicitConversions(clang::Sema&, clang::Expr*, clang::SourceLocation, bool) /usr/local/google/home/ibiryukov/code/llvm-project/clang/lib/Sema/SemaChecking.cpp:12772:0 llvm#5 CheckCommaOperand /usr/local/google/home/ibiryukov/code/llvm-project/clang/lib/Sema/SemaChecking.cpp:0:3 llvm#6 AnalyzeImplicitConversions /usr/local/google/home/ibiryukov/code/llvm-project/clang/lib/Sema/SemaChecking.cpp:12644:7 llvm#7 AnalyzeImplicitConversions(clang::Sema&, clang::Expr*, clang::SourceLocation, bool) /usr/local/google/home/ibiryukov/code/llvm-project/clang/lib/Sema/SemaChecking.cpp:12776:5 llvm#8 CheckCommaOperand /usr/local/google/home/ibiryukov/code/llvm-project/clang/lib/Sema/SemaChecking.cpp:0:3 llvm#9 AnalyzeImplicitConversions /usr/local/google/home/ibiryukov/code/llvm-project/clang/lib/Sema/SemaChecking.cpp:12644:7 llvm#10 AnalyzeImplicitConversions(clang::Sema&, clang::Expr*, clang::SourceLocation, bool) /usr/local/google/home/ibiryukov/code/llvm-project/clang/lib/Sema/SemaChecking.cpp:12776:5 llvm#11 CheckCommaOperand /usr/local/google/home/ibiryukov/code/llvm-project/clang/lib/Sema/SemaChecking.cpp:0:3 llvm#12 AnalyzeImplicitConversions /usr/local/google/home/ibiryukov/code/llvm-project/clang/lib/Sema/SemaChecking.cpp:12644:7 llvm#13 AnalyzeImplicitConversions(clang::Sema&, clang::Expr*, clang::SourceLocation, bool) /usr/local/google/home/ibiryukov/code/llvm-project/clang/lib/Sema/SemaChecking.cpp:12776:5 llvm#14 CheckCommaOperand /usr/local/google/home/ibiryukov/code/llvm-project/clang/lib/Sema/SemaChecking.cpp:0:3 llvm#15 AnalyzeImplicitConversions /usr/local/google/home/ibiryukov/code/llvm-project/clang/lib/Sema/SemaChecking.cpp:12644:7 llvm#16 AnalyzeImplicitConversions(clang::Sema&, clang::Expr*, clang::SourceLocation, bool) /usr/local/google/home/ibiryukov/code/llvm-project/clang/lib/Sema/SemaChecking.cpp:12776:5 llvm#17 CheckCommaOperand /usr/local/google/home/ibiryukov/code/llvm-project/clang/lib/Sema/SemaChecking.cpp:0:3 llvm#18 AnalyzeImplicitConversions /usr/local/google/home/ibiryukov/code/llvm-project/clang/lib/Sema/SemaChecking.cpp:12644:7 llvm#19 AnalyzeImplicitConversions(clang::Sema&, clang::Expr*, clang::SourceLocation, bool) /usr/local/google/home/ibiryukov/code/llvm-project/clang/lib/Sema/SemaChecking.cpp:12776:5 ... 700+ more stack frames. ```
This version focuses on the utilities to be the pivot. It also saves info directly in C++ files as part of get functions. Don't use the yamls anymore. Adds support for DPAS instruction.
Remove compiler warnings and errors.
94ae51e
to
af7098b
Compare
Hi @adam-smnk , @rolfmorel , @tkarna , @Jianhui-Li, @silee2, @chencha3 , @charithaintc, I added modified the PR to add a prototype. Please take a look at your conveniences. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Am I correct that this file is supposed to replace current https://github.com/llvm/llvm-project/blob/main/mlir/include/mlir/Dialect/XeGPU/IR/XeGPUTargetInfo.h?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potentially, yes.
@@ -577,6 +579,45 @@ LogicalResult DpasOp::verify() { | |||
if (getAcc() && getAcc().getType() != getResultType()) | |||
return emitOpError("Expecting the acc type to be the same as result."); | |||
|
|||
// @uArch: Check if the types are supported for DPAS. | |||
Operation *op = getOperation(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this piece can be hidden inside uArch utilities so client just say smth like
auto res = verify(op, "dpas");
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The issue with this is that it requires the utility to have all the parameters for of the op
.
Alternatively, the op would have to implement the same interface so that the uArch utility can get the parameters.
@@ -0,0 +1,14 @@ | |||
module @eltwise_add attributes {gpu.container_module} { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems the test checks are missing here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, sorry Chao, this test is supposed to be removed. The other 2 tests covers the mechanism. I pushed it initially forgot to remove it.
"xegpu::XeGPUDialect", "mlir::DLTIDialect" | ||
]; | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I want to raise the following discussion points:
- NV and AMD are putting nvvm-attach-target and rocdl-attach-target under the GPU dialect. Not sure whether it is better for us to follow their way too.
- They are attached to
GPUModuleOp
, a finer granularity than us. It may be a requirement of GPU dialect. A potential benefit is that they may be able to support the case that a ModuleOp containing multiple GPUModuleOp, with each GPUMoudleOp map to different GPUs from different vendors. - They used NVVMTargetAttr and ROCDLTargetAttr to store the info. Now, we have XeVMTargetAttr, can we reuse it here too? Considering we only capture the device name here, which is also captured by XeVMTargetAttr?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sang Ik created this PR: llvm#147372. is it similar to the pass here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, this appears similar to the xevm's attach target pass that is yet to be merged. After the merge, the chip
string of the xevm target should be appropriate to query uarch info:
auto targetDeviceArch =
mlir::xegpu::uArch::uArchMap::instance().get(targetDeviceNameStr);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do the XeVM targets have a sensible notion of a LLVM triple
along side that of chip
? How about (LLVM) features
? If so, it would probably make sense to have a #xevm.target
attribute, a la the proposed above, which implements the LLVMTargetAttrInterface
from this WIP upstream PR:
https://github.com/llvm/llvm-project/pull/145899/files#diff-6c2503d165a7390c955c3c4fa4a76fd1991633b5ec597a1b1fd92731e1f3684dR572
I am in the process of making #nvvm.target
and #rocdl.target
attributes implement this interface (in addition to the generic #llvm.target
attr in that PR). If sensible, the XeVM dialect should probably (try to) mirror this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given this in https://github.com/llvm/llvm-project/pull/147372/files#diff-3da7b8747032bb581f119fd49037d06706443c3c5db56d4d28cc74053a9b754dR284 in the Add xevm-attach-target transform pass PR, I take it there's a sensible triple.
I will have a look at making xevm.target
also implement my interface.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Chao, Rolf, Artem.
You are right once XeVM target attribute gets upstreamed, we should use it. This pass is here just to show the use case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
On that note, I found that the #xevm.target
attr itself already got merged: https://github.com/llvm/llvm-project/blob/b9d7513bf134febe72c05a04ff20f87191d7213a/mlir/include/mlir/Dialect/LLVMIR/XeVMOps.td#L521
We can get cracking!
dpas: | ||
instruction_name: dpas | ||
instruction_description: "Dot Product Accumulate Systolic (DPAS) is a matrix multiply-add operation" | ||
instruction_opcode: "0x83" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why opcode ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added some extra fields in the structs in case they become useful in the future, they are not useful in our current use-case. I can remove them if we don't think they are necessary.
// no_of_dims: 2 | ||
// dim: [2, 2] | ||
// This represents a 2x2 tile | ||
struct Tile { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tile is overloaded term. It looks to me just a shape. Does there any pre-defined data type you can reuse?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, Jianhui. I am not using it in the current version of the implementation. I kept it if we want to use it in some fashion later.
The current implementation uses std::vector.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mshahneo Looks like we should remove older version of the interface and keep the one we want to proceed with
// - [2, 16] | ||
// This represents a 2x2 RangeTile where the first dimension can have values | ||
// from 1 to 32 and the second dimension can have values from 2 to 16 | ||
struct RangeTile { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not just use DiscreteTile, if this is not used that often.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Kept two different ways to specify the range.
// dims: | ||
// - [1, 2, 4, 8, 16, 32] | ||
// - [2, 4, 8, 16] | ||
// This represents a 2x2 DiscreteTile where the first dimension can have values |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that DiscreteTile represents shapes combined from lists of sizes. How about name them ShapeCombination or something more specific to our problem and less general?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense. Current implementation actually uses std::vector.
std::tuple<Args...> data; | ||
std::function<bool(Args...)> func; | ||
|
||
Restriction(Args... args, std::function<bool(Args...)> f) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The term "restriction" is also very generic. Please consider making it specific. Validate the instance belongs to a combination?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Restriction is supposed to be model all restriction.
And a common validate() could just apply all the restrictions. However, we can always go back to specific validations. I'll think about this.
return {1, 2, 3, 4, 5, 6, 7, 8}; | ||
} | ||
|
||
std::vector<uint32_t> DPASInstruction::getSupportedK(mlir::Type type) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am a bit confused. I thought you will read the information from dlti, not hard-coded inside the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These hardware properties are static so I'd say they should be hardcoded somewhere.
Whether you can extract them from LLVM, inside DLTI query function, or as a separate helper like this (or IMEX's target info) is matter of what's most convenient.
I don't think we should be dumping whole hardware specification into IR as DLTI attributes.
llvm::errs() << "No parent module op.\n"; | ||
|
||
// It target device info is not attched, skip the target-specific checks | ||
auto targetDeviceNameAttr = dlti::query(moduleOp, {"GPU", "name"}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should discuss where to check the uArch info. The uArch info could be checked in the lowering passes, and the op definition should check general constraints (like a.m = c.m).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually the way the uArch info is set up. Once the target is attached, it can be queried from everywhere (pass and dialect).
|
||
--- | ||
# The target architecture name. | ||
arch: intel_gpu_pvc |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how is this file being loaded?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's an older artifact, we shouldn't have yaml at all.
instruction_description: "2D Block Load instruction reads a rectangular block of memory in to GRF." | ||
instruction_opcode: nan | ||
instruction_functional_unit: load | ||
instruction_type: simd |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
General: I don't think the yaml should write following the terminology in the internal uArch spec for PVC (like simd, systolic_depth, repeat_count, vnni ...). Instead, it should written following the SPIRV extnesion (or the OCL C function) (no mentioning of simd, using m/n/k, packing).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Jianhui,
We are not using the YAML in the current version.
Update:
After the first iteration of review comment, the current PR is update in the following way:
-- Full uArch definition, utilities and all (at least for one instruction for now, DPAS).
-- Definition for 2 uArchs (PVC, BMG).
-- A pass to attach the uArch name to the module attribute using DLTI.
-- Shows the usage of the uArch interface to verify XeGPU ops.
-- Few test cases.
P.S. I still kept some old code intentionally, if we want to go back or use some of it. I'll clean up once we agree on the process.
================================
Hi All,
This is the initial PR for uArch definition. The primary purpose for this PR is to have a discussion and finalize the design. So please think of it as an RFC as well.
First target we want to achieve here is to finalize the skeleton:
Due to this purpose, the main focus point of this PR are the following files:
mlir/lib/Dialect/XeGPU/Utils/intel_gpu_pvc.yaml
: holds the YAML definition of uArch (PVC is used for this use case)mlir/include/mlir/Dialect/XeGPU/Utils/uArch.h
: holds the base uArch struct that can be re-used by specialized uArchs such as PVC.mlir/include/mlir/Dialect/XeGPU/Utils/IntelGpuPVC.h
: holds the PVC-specific structs.mlir/lib/Dialect/XeGPU/Utils/IntelGpuPVC.cpp is added as placeholder for YAML mapping. This file is intentionally not complete. We want to reach a consensus on the YAML, and C++ structures before filling this one in.
Please provide your valuable feedback as we try to finalize this together.