Commit db4f7a3

authored and

committed

Tweak KleidiAI's FP16 matmul algorithm (pytorch#4416)

Summary: Pull Request resolved: pytorch#4416 X-link: facebookresearch/FBGEMM#1488 Hoist memory loads from the outer loop Intention is to prevent these loads from displacing cache lines, as they may contain matrix data. Similarly, the loads are likely to inccur in cache misses after the first iteration. Executing the inner loop will probably fill the cache with matrix data. Benchmarks repeatedly show a throughput improvement of around 1%. before: P1854747253 after: P1854747141 Reviewed By: YifanYuan3 Differential Revision: D77459967 fbshipit-source-id: 01eb4fc004ba055823551d843f7bf7728caa74a8

1 parent 3f2cd6e commit db4f7a3Copy full SHA for db4f7a3

1 file changed

+229

-198

lines changed

src
- KleidiAIFP16UKernelsNeon.cc

1 file changed

+229

-198

lines changed

Comments

(0)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Commit db4f7a3

1 file changed

1 file changed

File tree

1 file changed

1 file changed

0 commit comments