Skip to content

Commit 4270682

Browse files
authored
Waiting for BMM NZ support(Improve TPOP 2ms performance) (#1131)
### What this PR does / why we need it? W_UV/W_UK_T cannot be converted to nz, because this position will be fused into transposebatchmatmul, which does not support nz. The weights are actually converted back to nd in each run. ### Does this PR introduce _any_ user-facing change? Use #1098 as the baseline, p90 TPOT 90.79ms->88.58ms, improve TPOP 2ms ### How was this patch tested? use #1101 --------- Signed-off-by: ttanzhiqiang <389825161@qq.com>
1 parent 0d2074a commit 4270682

File tree

1 file changed

+4
-2
lines changed

1 file changed

+4
-2
lines changed

vllm_ascend/attention/mla_v1.py

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -648,8 +648,10 @@ def get_and_maybe_dequant_weights(layer: LinearBase):
648648
self.W_UV = W_UV.transpose(0, 1).contiguous()
649649
# Convert from (L, N, P) to (N, P, L)
650650
self.W_UK_T = W_UK.permute(1, 2, 0).contiguous()
651-
self.W_UV.data = torch_npu.npu_format_cast(self.W_UV.data, 29)
652-
self.W_UK_T.data = torch_npu.npu_format_cast(self.W_UK_T.data, 29)
651+
652+
# Waiting for BMM NZ support
653+
# self.W_UV.data = torch_npu.npu_format_cast(self.W_UV.data, 29)
654+
# self.W_UK_T.data = torch_npu.npu_format_cast(self.W_UK_T.data, 29)
653655

654656
def _compute_prefill_context(
655657
self,

0 commit comments

Comments
 (0)