Description
The i8mm lowering for some vector.contract
ops is currently functionally correct. However, performance wise there is some room for improvement. Looking at the generated asm for an mmt4d with 2x2x8 innermost tile sizes, we get:
1470: 6e180483 mov v3.d[1], v4.d[0]
1474: 4e006204 tbl v4.16b, { v16.16b, v17.16b, v18.16b, v19.16b }, v0.16b
1478: 4e84a462 smmla v2.4s, v3.16b, v4.16b
147c: 6e024041 ext v1.16b, v2.16b, v2.16b, #0x8
It calls my attention the mov
instruction, esp. the indexing from 1
to 0
, the tbl
and the ext
instructions. This may not seem a big deal but the problem is really exacerbated when using larger tile sizes. We observed large sequences of mov
and ext
instructions all over the place.
We should investigate what is going on and try to fix the problem. My suspicion is that this zero initialization and insertion for vecmat
cases might be behind some of these instructions. We should try if using llvm.undef
fixes part of the problem.