[CPU] Introduce GatherMatmul operation to optimize MoE pattern #32450

maxnick · 2025-10-16T15:49:16Z

Details:

In this PR we introduce yet another operation "GatherMatmu", which essentially does gemv operations over the current tokens and the active experts.
As the first step, we perform gemv operation using the dnnl::inner_product. But obviously this solution is suboptimal, as it doesn't give a fine grain control over parallelization, and in the case of many tokens being processed by a specific expert (prefill), having gemm operation may be more optimal as the tokens may be batched and we can do SIMD level parallelization by tokens as well.
Also this PR contains all the essential transformations that allow to enable a few common MoE patterns.

MoE pattern matcher is based on #32183

Related oneDNN fork PR: openvinotoolkit/oneDNN#292

Tickets:

CVS-171910

2. initMoE2GeMMSubgraph builder is moved to a separate file 3. initMoE3GeMMSubgraph

Copilot

Pull Request Overview

This PR introduces the GatherMatmul operation to optimize Mixture of Experts (MoE) patterns in the CPU plugin. The implementation performs GEMV operations over active experts using oneDNN's inner_product primitive, with support for both standard and compressed weights configurations.

Key changes:

Adds GatherMatmul node implementation with oneDNN-based execution for both GEMV and GEMM modes
Implements pattern matchers (MoE2GeMM and MoE3GeMM) to detect and transform MoE subgraphs
Extends weight decompression infrastructure to support batched (3D) weight tensors
Introduces CompressedWeightsBlock pattern block to share weight decompression logic across operations

Reviewed Changes

Copilot reviewed 31 out of 31 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
src/tests/functional/plugin/shared/src/subgraph/weights_decompression_builders.cpp	Updated to support batched weight decompression with seed parameter and optional transpose control
src/tests/functional/plugin/shared/src/subgraph/moe_builders.cpp	New MoE test graph builders for 2GEMM and 3GEMM patterns with weight decompression support
src/plugins/intel_cpu/src/nodes/gathermatmul.cpp	Core implementation of GatherMatmul node with oneDNN inner_product backend
src/plugins/intel_cpu/src/transformations/cpu_opset/common/pass/convert_moe_matmuls.cpp	Pattern matchers to detect and replace MoE patterns with BatchGatherMatmul operations
src/plugins/intel_cpu/src/transformations/cpu_opset/common/op/batch_gather_matmul*.cpp	New internal operations for batch gather matmul with and without compression
src/common/transformations/src/transformations/op_conversions/convert_fc_to_compressed.cpp	Refactored to extract reusable weight processing logic into static method
src/core/include/openvino/pass/pattern/op/block_util.hpp	Updated FOR_EACH macros to support passing block pointer as parameter

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

src/plugins/intel_cpu/src/nodes/gathermatmul.cpp

src/core/include/openvino/pass/pattern/op/block_util.hpp

src/plugins/intel_cpu/src/nodes/gathermatmul.cpp

maxnick added do_not_review do_not_merge labels Oct 16, 2025

maxnick assigned maxnick and v-Golubev and unassigned maxnick Oct 16, 2025

github-actions bot added category: Core OpenVINO Core (aka ngraph) category: IE Tests OpenVINO Test: plugins and common category: CPU OpenVINO CPU plugin category: transformations OpenVINO Runtime library - Transformations category: CPP API OpenVINO CPP API bindings labels Oct 16, 2025

maxnick and others added 20 commits October 20, 2025 10:33

First working subgraph test

0cc7848

Add din shapes

f15e27e

Add batch gather matmul internal op

53a406c

Use the new transformation name

abbc740

Make shapes dynamic and the patter closer to gpt-oss

875e60c

Fix the batch gather matmul shapes check

3ead725

Yet another shape inference fix

9a3028a

Fixed the matmul shape inference reuse

9628560

Introduce GatherMatmul node

ad97d62

[CPU] Introduced ConvertMoEMatMuls transformation

86e3126

ConvertBatchGatherMatmulToBatchGatherMatmulCompressed

34c5c50

Gather Matmul initial impl

1a9fd5b

1. Skip Slice in case of GPT-oss case

a88baea

2. initMoE2GeMMSubgraph builder is moved to a separate file 3. initMoE3GeMMSubgraph

Floating point calculation working state

a7148cc

Make GPT-OSS test working

ebf6605

Use keep_dims in the top_k normalization

55b5e98

Make fp32 tests working

db4eb6d

[TESTS] MoECompressedWeightsSubgraphTest class

f7d654e

Apply transformations ToDo

461df5a

Adjust bf16 test threshold

a6d5711

maxnick and others added 12 commits October 23, 2025 17:47

clang format fix

4f920ef

Added check_results for MoESubgraphTest

0f6fc7a

Refactored test builders to exactly match the real models

9d2a8b7

ConvertMoEMatMuls fix

92532f6

style & tidy fixes

1ef951c

Prepare code to optimize prefill via gemm

f5b75dc

Use GEMM on prefill

2671609

Optimize gather GEMM keeping data hot in cache

706bf0c

Enable bf16 decompression tests

04caa44

Add GatherMatmul to bf16 markup

6b2a7ea

Split prefill strategies for amx and avx

cf5404b

BF16 support filtering in tests

9e848a0

maxnick requested a review from Copilot October 28, 2025 16:55

Copilot AI reviewed Oct 28, 2025

View reviewed changes

src/plugins/intel_cpu/src/nodes/gathermatmul.cpp Outdated Show resolved Hide resolved

src/core/include/openvino/pass/pattern/op/block_util.hpp Show resolved Hide resolved

src/plugins/intel_cpu/src/nodes/gathermatmul.cpp Outdated Show resolved Hide resolved

maxnick marked this pull request as ready for review October 28, 2025 16:57

maxnick requested review from a team as code owners October 28, 2025 16:57

maxnick requested review from mryzhov and removed request for a team October 28, 2025 16:57

maxnick removed do_not_review do_not_merge labels Oct 28, 2025

maxnick added 4 commits October 28, 2025 18:09

Fix pure bf16 use case

0824a49

Fix CC build

65bd473

Code style and copilot comments

11cf8df

Modify MoE test to better handle bf16

83c923e

maxnick force-pushed the cpu_moe_op_support branch from d7f9425 to 91001a5 Compare October 29, 2025 16:24

Add really quantized weights to the subgraph test

91001a5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[CPU] Introduce GatherMatmul operation to optimize MoE pattern #32450

[CPU] Introduce GatherMatmul operation to optimize MoE pattern #32450

maxnick commented Oct 16, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[CPU] Introduce GatherMatmul operation to optimize MoE pattern #32450

Are you sure you want to change the base?

[CPU] Introduce GatherMatmul operation to optimize MoE pattern #32450

Conversation

maxnick commented Oct 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Details:

Tickets:

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

maxnick commented Oct 16, 2025 •

edited

Loading