[perf][WIP]: using NZ optimization for quantized GMM #906

linfeng-yuan · 2025-05-20T11:41:48Z

This PR increases the decode inference speed by using NZ optimization for GroupedMatmul. Note that this optimization relies on the adaptation of torch_npu which is not released currently. It is better to hang up this change util new torch_npu releases.

Signed-off-by: linfeng-yuan <1102311262@qq.com>

wangxiyuan · 2025-05-22T10:52:31Z

if it's ready for review, please remove WIP prefix

github-actions · 2025-06-23T14:05:50Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

github-actions bot added the module:quantization label May 20, 2025

linfeng-yuan force-pushed the gmm_nz branch 3 times, most recently from dd7e0bc to 7f566a2 Compare May 21, 2025 06:40

[perf]: using NZ optimization for quantized GMM

99fbe2c

Signed-off-by: linfeng-yuan <1102311262@qq.com>

linfeng-yuan force-pushed the gmm_nz branch from 7f566a2 to 99fbe2c Compare May 21, 2025 06:42

wangxiyuan mentioned this pull request Jun 4, 2025

[release] 0.9.0rc1 release checklist #904

Open

76 tasks

github-actions bot added the merge-conflicts label Jun 23, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[perf][WIP]: using NZ optimization for quantized GMM #906

[perf][WIP]: using NZ optimization for quantized GMM #906

linfeng-yuan commented May 20, 2025 •

edited

Loading

Uh oh!

wangxiyuan commented May 22, 2025

Uh oh!

github-actions bot commented Jun 23, 2025

Uh oh!

Uh oh!

[perf][WIP]: using NZ optimization for quantized GMM #906

Are you sure you want to change the base?

[perf][WIP]: using NZ optimization for quantized GMM #906

Conversation

linfeng-yuan commented May 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wangxiyuan commented May 22, 2025

Uh oh!

github-actions bot commented Jun 23, 2025

Uh oh!

Uh oh!

linfeng-yuan commented May 20, 2025 •

edited

Loading