Add gatherf_byte_inds for byte indices from memory #511

rygorous · 2024-11-03T19:22:23Z

All the gathers in the codebase pass a vint for indices that has just been initialized from an array of uint8_ts in memory.

This is significant because for the NEON/SSE emulation paths, there is no native gather instruction to begin with and the first step is to get the indices back to the integer pipe and split them into individual pieces. In this case it is definitely better to just load the indices on the int pipes to begin with; this formulation facilitates that. (Needs to be a template because unlike the original gatherf, there is no vint argument that implies the vector width for overload resolution.)

Additionally, the gathers in this codebase don't actually make use of predication (the predicates are always all on). That means we have a subset of gather functionality that is fairly easy to emulate manually: indices are readily available on the integer pipes, and no predication, so all we need to do is perform a known number of vector loads and assemble the result.

Therefore, provide an option to avoid gather instructions even on AVX2 where they do exist. Gather performance is middling on newer Intel uArchs and outright bad on older (pre-Skylake) P-core Intel uArchs, Intel's E-cores, and AMD's offerings. At least on my home Zen 4, doing the 8 broadcasts + shuffles is much faster than using the native gather instructions, to the tune of a ~13.5% reduction in total coding time.

Test results: (using MSVC 2022 as compiler)

On Intel Skylake-X, using the manual gathers is appreciably slower than the native gather instructions. (+6% coding time in my tests)
On AMD Zen 2 and Zen 4, avoiding gathers is much faster (as noted above, 13.5% reduction on Zen 4).
On Intel Redwood Cove and Intel Crestmont, avoiding gathers comes out around 3-4% faster in my tests depending on the test.

All the gathers in the codebase pass a vint for indices that has _just_ been initialized from an array of uint8_ts in memory. This is significant because for the NEON/SSE emulation paths, there is no native gather instruction to begin with and the first step is to get the indices back to the integer pipe and split them into individual pieces. In this case it is definitely better to just load the indices on the int pipes to begin with; this formulation facilitates that. (Needs to be a template because unlike the original gatherf, there is no vint argument that implies the vector width for overload resolution.) Additionally, the gathers in this codebase don't actually make use of predication (the predicates are always all on). That means we have a subset of gather functionality that is fairly easy to emulate manually: indices are readily available on the integer pipes, and no predication, so all we need to do is perform a known number of vector loads and assemble the result. Therefore, provide an option to avoid gather instructions even on AVX2 where they do exist. Gather performance is middling on newer Intel uArchs and outright bad on older (pre-Skylake) P-core Intel uArchs, Intel's E-cores, and AMD's offerings. At least on my home Zen 4, doing the 8 broadcasts + shuffles is _much_ faster than using the native gather instructions, to the tune of a ~13.5% reduction in total coding time. Test results: (using MSVC 2022 as compiler) - On Intel Skylake-X, using the manual gathers is appreciably slower than the native gather instructions. (+6% coding time in my tests) - On AMD Zen 2 and Zen 4, avoiding gathers is much faster (as noted above, 13.5% reduction on Zen 4). - On Intel Redwood Cove and Intel Crestmont, avoiding gathers comes out around 3-4% faster in my tests depending on the test.

solidpixel · 2024-11-04T21:42:53Z

On my home machine (Intel i5-6500K, CoffeeLake):

SSE4.1 - 3-4% faster by avoiding the byte-to-int conversion.
NoGather AVX2 - comes in around 6% slower (tested with both Clang 14 and GCC 11)

solidpixel · 2024-11-04T22:23:23Z

On my laptop (Apple M1)

NEON - is 2% faster by avoiding the byte-to-int conversion.

Fabian Giesen and others added 2 commits November 3, 2024 11:20

Merge branch 'main' into avoid-gathers

a8bcb2f

solidpixel self-requested a review November 4, 2024 21:43

Invert the CMake option to avoid negated options

635c87b

solidpixel merged commit 546f9dd into ARM-software:main Nov 4, 2024
7 checks passed

rygorous deleted the avoid-gathers branch November 9, 2024 00:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add gatherf_byte_inds for byte indices from memory #511

Add gatherf_byte_inds for byte indices from memory #511

Uh oh!

rygorous commented Nov 3, 2024

Uh oh!

solidpixel commented Nov 4, 2024

Uh oh!

solidpixel commented Nov 4, 2024 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Add gatherf_byte_inds for byte indices from memory #511

Add gatherf_byte_inds for byte indices from memory #511

Uh oh!

Conversation

rygorous commented Nov 3, 2024

Uh oh!

solidpixel commented Nov 4, 2024

Uh oh!

solidpixel commented Nov 4, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

solidpixel commented Nov 4, 2024 •

edited

Loading