-
Notifications
You must be signed in to change notification settings - Fork 113
Open
Milestone
Description
A meta-issue tracking various ideas for SIMD-optimized string algorithms. A reason why we're making our own string APIs is because the C string library is made for null-terminated strings, which is quite useless when the major use case is working on slices of larger strings, such as in parsers. And (of course) the C++ counterparts are too bloated and either impose an implicit allocation or require a too new C++ standard.
SIMD in general
- Needs Compile-time and runtime CPU feature (SIMD) detection and dispatch #115 to be merged
-
Figure out how to dispatchhandled in Compile-time and runtime CPU feature (SIMD) detection and dispatch #115 as well
-
Construction
- Add a way to automatically mark a view as
Global
if it's a C string literal (using__constant_string_p()
compiler builtin, details in constexpr StringView without literals #123) - Add a way to automatically mark an arbitrary
const char*
asGlobal
by checking if the pointer is in a read-only memory page -- https://github.com/RobloxResearch/SIMDString/blob/main/SIMDString.cpp#L54-L75
Searching
- SIMD-optimized
memchr()
,memchr2()
,memchr3()
etc., used infind(char)
,findAnyOf()
,split()
etc. for quickly skipping until end of line or end of a JSON string -- https://docs.rs/memchr/2.3.4/src/memchr/x86/avx.rs.html- There's also a SSE 4.2 implementation here: https://www.strchr.com/strcmp_and_strlen_using_sse_4.2 (the null-terminated aspect is optional in the instructions, I hope?)
- "But yes, any strcmp/strlen style function that uses SSE/AVX/etc. without a buffer size parameter, must first read in smaller units until it gets to a 16/32 byte aligned address, after which it can start using SSE/AVX safely without having to worry about faulting even if it reads past the end of the string." means I wouldn't need to have the tail part, unlike the Rust implementation? needs significant amount of testing tho, and probably could still trigger ASan with growable arrays, ARM MTE etc.
- Possibly an inverse (
findAnyNotOf()
) when there's a lot of candidates andmemchr13()
would be impossible to vectorize ("find the first that's not a space or newline and then decide what to do next") - Movemask isn't on ARM, but https://twitter.com/Danlark1/status/1539344315356520450 should be a better option than emulating it
- The scalar fallback (and less-than-a-vector fallback?) could be optimized to use Duff's Device or SWAR: https://lore.kernel.org/lkml/20220710142822.52539-1-arthurchang09@gmail.com/T/#u
- Anything useful in https://github.com/ashvardanian/Stringzilla?
- There's also an AVX512 variant in https://github.com/Kraionix/ImGuiMemchrBench/blob/master/ImMemchrBench/immemchr.h
- There's also a SSE 4.2 implementation here: https://www.strchr.com/strcmp_and_strlen_using_sse_4.2 (the null-terminated aspect is optional in the instructions, I hope?)
-
memrchr()
and variants of above forfindLastAnyOf()
. Linux has it, not sure where to look for a non-tainted implementation. At least we can benchmark against it as a black box. -
memmem()
for faster generalfind()
because of coursestrstr()
is useless and I think we could do better thanmemcmp()
in a loop. Linux has it, not sure where to look for a non-tainted implementation. At least we can benchmark against it as a black box.- and
memrmem()
forfindLast()
- if ever needed, a STL-less alternative to
std::boyer_moore_searcher
(but this one needs a lot of preprocessing state)- https://stackoverflow.com/questions/3183582/what-is-the-fastest-substring-search-algorithm
- Exact Packed String Matching (2013) is the best?
- https://smart-tool.github.io/smart/
- check https://github.com/davidesantangelo/krep for anything useful (it picks SSE4.2 for certain search sizes, AVX2 is also advertised but not actually present)
- and
- Levenshtein distance ("did you mean ...") implementation finding the closest string from a list when
Arguments
, plugin name, ... is mistyped --https://github.com/ninja-build/ninja/blob/ca041d88f4d610332aa48c801342edfafb622ccb/src/edit_distance.cc#L20-L69
Comparison
- Check pointer equality before calling into
memcmp()
inStringView::operator==()
, could save a lot especially when comparing literals that the compiler might have deduplicated- Don't do that in
String
tho
- Don't do that in
- Would we gain anything by implementing
memcmp()
ourselves?- especially for SSO strings that have a fixed size, which could be a single (masked) instruction?
- by not having to explicitly test for
nullptr
whensize == 0
just to not hit an UB because the standard is stupid and generally disallows passingnullptr
to any string/memory function even if the size is zero?
- Case-insensitive comparison -- http://www.phoronix.com/scan.php?page=news_item&px=Glibc-strcasecmp-AVX2-EVEX
- Possibly useful for extension comparison in
Any*
plugins, OTOH there it's probably faster to normalize the extension first and then do 100memcmp()
s
- Possibly useful for extension comparison in
Unicode
- UTF-16 surrogates to UTF-8 and back, for JSON
\u
parsing - SIMD-ified UTF-8 codepoint counting and validation -- https://lemire.me/blog/2020/10/20/ridiculously-fast-unicode-utf-8-validation/
- UTF-8 character counting including counting decomposed characters (such as
a
and¨
as one), needs some Unicode categorization table- does https://github.com/ridiculousfish/widecharwidth work like that or not?
Number-to-string
For Utility::Debug
, Utility::format()
etc. The core should be a direct overhead-less API working on builtin types (writing into a statically-sized char[]
, e.g.), with convenience wrappers above.
- integer-to-string conversion instead of stupidly using
printf()
-- even a dumbwhile
loop would be better- https://lemire.me/blog/2022/03/28/converting-integers-to-decimal-strings-faster-with-avx-512/
- https://github.com/jeaiii/itoa (and explanation from https://jk-jeon.github.io/posts/2022/02/jeaiii-algorithm/), doesn't seem to use any explicit SIMD?
- separate optimized hex, bin and oct variant
- float-to-string conversion instead of abusing
printf()
- Ryu is outdated, latest and greatest is https://github.com/jk-jeon/dragonbox but it's C++17, quite large with a lot of options we don't need, do a clean-room implementation from the paper alone?
- MSVC
to_chars
used Ryu, not sure if it's still: https://twitter.com/StephanTLavavej/status/992881364142702593 - Older benchmarks here: https://github.com/miloyip/dtoa-benchmark
- MSVC
- Explicit option to write or fail on NaN and Inf (for JSON compat)
- Uppercase or lowercase exponent
- Option to write hex floats (or a separate function that doesn't need the complex algo?)
- Ryu is outdated, latest and greatest is https://github.com/jk-jeon/dragonbox but it's C++17, quite large with a lot of options we don't need, do a clean-room implementation from the paper alone?
- Any easy gains with a vectorized variant? "Convert 8 numbers to 8 strings at once"
- Useful for JSON, OBJ, ... export
String-to-number
Because strto*()
has insane usability issues.
- string-to-integer conversion instead of stupidly using
strtol()
-- even a dumbwhile
loop could be better- have 8-, 16-, 32- bit variants that check the range on their own
- remove the stupid handling of
-
for unsigned types (who thought that was a good idea??) - https://lemire.me/blog/2022/01/21/swar-explained-parsing-eight-digits/
- https://kholdstare.github.io/technical/2020/05/26/faster-integer-parsing.html and the lengthy post linked from there
- separate optimized hex, bin and oct variant
- string-to-float conversion instead of using the shitty & slow
strtof()
/strtod()
- What's the algorithm here? Some pointers could be in comments at https://lemire.me/blog/2020/09/10/parsing-floats-in-c-benchmarking-strtod-vs-from_chars/
- Explicit option to parse or fail on NaN and Inf (for JSON compat)
- Explicit option to write or ignore hex floats (or a separate function that doesn't need the complex algo?)
- Also watch out for accumulated rounding errors when going to float through a double: https://www.exploringbinary.com/double-rounding-errors-in-decimal-to-double-to-float-conversions/
- Any easy gains with a vectorized variant? "Convert 8 strings to 8 numbers at once"
- Useful for JSON, OBJ, ... parsing
General printing
- Is some printf compatibility desired? like https://github.com/tfc/pprintpp
- Optimization ideas from Rust: https://internals.rust-lang.org/t/changing-core-fmt-for-speed/5154
Metadata
Metadata
Assignees
Labels
No labels
Projects
Status
TODO
Status
TODO