- 
                Notifications
    You must be signed in to change notification settings 
- Fork 2.8k
[CPU] Introduce GatherMatmul operation to optimize MoE pattern #32450
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
2. initMoE2GeMMSubgraph builder is moved to a separate file 3. initMoE3GeMMSubgraph
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR introduces the GatherMatmul operation to optimize Mixture of Experts (MoE) patterns in the CPU plugin. The implementation performs GEMV operations over active experts using oneDNN's inner_product primitive, with support for both standard and compressed weights configurations.
Key changes:
- Adds GatherMatmulnode implementation with oneDNN-based execution for both GEMV and GEMM modes
- Implements pattern matchers (MoE2GeMMandMoE3GeMM) to detect and transform MoE subgraphs
- Extends weight decompression infrastructure to support batched (3D) weight tensors
- Introduces CompressedWeightsBlockpattern block to share weight decompression logic across operations
Reviewed Changes
Copilot reviewed 31 out of 31 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description | 
|---|---|
| src/tests/functional/plugin/shared/src/subgraph/weights_decompression_builders.cpp | Updated to support batched weight decompression with seed parameter and optional transpose control | 
| src/tests/functional/plugin/shared/src/subgraph/moe_builders.cpp | New MoE test graph builders for 2GEMM and 3GEMM patterns with weight decompression support | 
| src/plugins/intel_cpu/src/nodes/gathermatmul.cpp | Core implementation of GatherMatmul node with oneDNN inner_product backend | 
| src/plugins/intel_cpu/src/transformations/cpu_opset/common/pass/convert_moe_matmuls.cpp | Pattern matchers to detect and replace MoE patterns with BatchGatherMatmul operations | 
| src/plugins/intel_cpu/src/transformations/cpu_opset/common/op/batch_gather_matmul*.cpp | New internal operations for batch gather matmul with and without compression | 
| src/common/transformations/src/transformations/op_conversions/convert_fc_to_compressed.cpp | Refactored to extract reusable weight processing logic into static method | 
| src/core/include/openvino/pass/pattern/op/block_util.hpp | Updated FOR_EACH macros to support passing block pointer as parameter | 
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
d7f9425    to
    91001a5      
    Compare
  
    
Details:
In this PR we introduce yet another operation "GatherMatmu", which essentially does gemv operations over the current tokens and the active experts.
As the first step, we perform gemv operation using the dnnl::inner_product. But obviously this solution is suboptimal, as it doesn't give a fine grain control over parallelization, and in the case of many tokens being processed by a specific expert (prefill), having gemm operation may be more optimal as the tokens may be batched and we can do SIMD level parallelization by tokens as well.
Also this PR contains all the essential transformations that allow to enable a few common MoE patterns.
MoE pattern matcher is based on #32183
Related oneDNN fork PR: openvinotoolkit/oneDNN#292
Tickets: