Add gatherf_byte_inds for byte indices from memory #511
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
All the gathers in the codebase pass a vint for indices that has just been initialized from an array of uint8_ts in memory.
This is significant because for the NEON/SSE emulation paths, there is no native gather instruction to begin with and the first step is to get the indices back to the integer pipe and split them into individual pieces. In this case it is definitely better to just load the indices on the int pipes to begin with; this formulation facilitates that. (Needs to be a template because unlike the original gatherf, there is no vint argument that implies the vector width for overload resolution.)
Additionally, the gathers in this codebase don't actually make use of predication (the predicates are always all on). That means we have a subset of gather functionality that is fairly easy to emulate manually: indices are readily available on the integer pipes, and no predication, so all we need to do is perform a known number of vector loads and assemble the result.
Therefore, provide an option to avoid gather instructions even on AVX2 where they do exist. Gather performance is middling on newer Intel uArchs and outright bad on older (pre-Skylake) P-core Intel uArchs, Intel's E-cores, and AMD's offerings. At least on my home Zen 4, doing the 8 broadcasts + shuffles is much faster than using the native gather instructions, to the tune of a ~13.5% reduction in total coding time.
Test results: (using MSVC 2022 as compiler)