Avoid unfused Transpose in DeepSeekV3 EP256 MoE layer (#1091)

sdmyzlp · web-flow · commit 3640c60b0eb4 · 2025-06-07T14:28:20.000+08:00
### What this PR does / why we need it?

View optimization in torchair (defaulted to on for Transpose with any of
its axis being 1) prevents the weight Transpose to be fused with later
GroupedMatmul, which decrease the performance of MoE layer when expert
parallelism equals the total number of experts (e.g. EP256 for DSKv3).
Add an option to solve this problem by disabling the optimization.

### Does this PR introduce _any_ user-facing change?

Controlled by
`additional_config.torchair_graph_config.enable_view_optimize`,
defaulted to `True`.

### How was this patch tested?

Tested on 1x16 910 node, with tailored 2 layer DSKv2.

Signed-off-by: sdmyzlp &lt;lrwei2@petalmail.com&gt;
diff --git a/docs/source/user_guide/additional_config.md b/docs/source/user_guide/additional_config.md
@@ -38,6 +38,7 @@ The details of each config option are as follows:
 | Name | Type | Default | Description |
 | ---- | ---- | ------- | ----------- |
 | `enabled` | bool | `False` | Whether to enable torchair graph mode |
+| `enable_view_optimize` | bool | `True` | Whether to enable torchair view optimization |
 | `use_cached_graph` | bool | `False` | Whether to use cached graph |
 | `graph_batch_sizes` | list[int] | `[]` | The batch size for torchair graph cache |
 | `graph_batch_sizes_init` | bool | `False` | Init graph batch size dynamically if `graph_batch_sizes` is empty |
diff --git a/vllm_ascend/ascend_config.py b/vllm_ascend/ascend_config.py
@@ -55,6 +55,8 @@ def __init__(self, torchair_graph_config):
             "graph_batch_sizes_init", False)
         self.enable_multistream_shared_expert = torchair_graph_config.get(
             "enable_multistream_shared_expert", False)
+        self.enable_view_optimize = torchair_graph_config.get(
+            "enable_view_optimize", True)
 
         if not isinstance(self.graph_batch_sizes, list):
             raise TypeError("graph_batch_sizes must be list[int]")
diff --git a/vllm_ascend/worker/model_runner.py b/vllm_ascend/worker/model_runner.py
@@ -1037,6 +1037,8 @@ def load_model(self) -> None:
             config = torchair.CompilerConfig()
             config.experimental_config.frozen_parameter = True
             config.experimental_config.tiling_schedule_optimize = True
+            config.experimental_config.enable_view_optimize = \
+            get_ascend_config().torchair_graph_config.enable_view_optimize
             torch.npu.set_compile_mode(jit_compile=False)
             if not self.use_cached_npu_graph:
                 npu_backend = torchair.get_npu_backend(compiler_config=config)
diff --git a/vllm_ascend/worker/model_runner_v1.py b/vllm_ascend/worker/model_runner_v1.py
@@ -1286,6 +1286,8 @@ def _get_torchair_lazy_compiled_model(self, batch_size: int):
         config = torchair.CompilerConfig()
         config.experimental_config.frozen_parameter = True
         config.experimental_config.tiling_schedule_optimize = True
+        config.experimental_config.enable_view_optimize = \
+        get_ascend_config().torchair_graph_config.enable_view_optimize
         torch.npu.set_compile_mode(jit_compile=False)
         if not self.use_cached_npu_graph:
             npu_backend = torchair.get_npu_backend(compiler_config=config)