Skip to content

Commit dd7e0bc

Browse files
committed
[perf]: using NZ optimization for quantized GMM
1 parent 5cf9ff1 commit dd7e0bc

File tree

1 file changed

+7
-0
lines changed

1 file changed

+7
-0
lines changed

vllm_ascend/quantization/w8a8_dynamic.py

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -663,6 +663,13 @@ def process_weights_after_loading(self, layer):
663663
1, 2).contiguous()
664664
layer.w2_weight.data = layer.w2_weight.data.transpose(
665665
1, 2).contiguous()
666+
# This optimization relies on the modifications in torch_npu, otherwise accuracy
667+
# problem will happen. But we can evaluate the inference speed by transforming
668+
# weights to NZ (29)
669+
layer.w13_weight.data = torch_npu.npu_format_cast(
670+
layer.w13_weight.data, 29)
671+
layer.w2_weight.data = torch_npu.npu_format_cast(
672+
layer.w2_weight.data , 29)
666673
layer.w13_weight_scale.data = layer.w13_weight_scale.data.view(
667674
layer.w13_weight_scale.data.shape[0], -1)
668675
layer.w13_weight_offset.data = layer.w13_weight_offset.data.view(

0 commit comments

Comments
 (0)