String API additions and SIMD optimizations

A meta-issue tracking various ideas for SIMD-optimized string algorithms. A reason why we're making our own string APIs is because the C string library is made for null-terminated strings, which is quite useless when the major use case is working on slices of larger strings, such as in parsers. And (of course) the C++ counterparts are too bloated and either impose an implicit allocation or require a too new C++ standard.

# SIMD in general

- [x] Needs #115 to be merged
  - [x] ~~Figure out how to dispatch~~ handled in #115 as well
  
# Construction

- [ ] Add a way to automatically mark a view as `Global` if it's a C string literal (using `__constant_string_p()` compiler builtin, details in #123)
- [ ] Add a way to automatically mark an arbitrary `const char*` as `Global` by checking if the pointer is in a read-only memory page -- https://github.com/RobloxResearch/SIMDString/blob/main/SIMDString.cpp#L54-L75

# Searching

- [ ] SIMD-optimized `memchr()`, `memchr2()`, `memchr3()` etc., used in `find(char)`, `findAnyOf()`, `split()` etc. for quickly skipping until end of line or end of a JSON string -- https://docs.rs/memchr/2.3.4/src/memchr/x86/avx.rs.html
  - [ ] There's also a SSE 4.2 implementation here: https://www.strchr.com/strcmp_and_strlen_using_sse_4.2 (the null-terminated aspect is optional in the instructions, *I hope*?)
    - [ ] "But yes, any strcmp/strlen style function that uses SSE/AVX/etc. without a buffer size parameter, must first read in smaller units until it gets to a 16/32 byte aligned address, after which it can start using SSE/AVX safely without having to worry about faulting even if it reads past the end of the string." means I wouldn't need to have the tail part, unlike the Rust implementation? needs significant amount of testing tho, and probably could still trigger ASan with growable arrays, [ARM MTE](https://community.arm.com/arm-community-blogs/b/architectures-and-processors-blog/posts/enhancing-memory-safety) etc. 
  - [ ] Possibly an inverse (`findAnyNotOf()`) when there's a lot of candidates and `memchr13()` would be impossible to vectorize ("find the first that's not a space or newline and then decide what to do next")
  - [ ] Movemask isn't on ARM, but https://twitter.com/Danlark1/status/1539344315356520450 should be a better option than emulating it
  - [ ] The scalar fallback (and less-than-a-vector fallback?) could be optimized to use Duff's Device or SWAR: https://lore.kernel.org/lkml/20220710142822.52539-1-arthurchang09@gmail.com/T/#u
  - [ ] Anything useful in https://github.com/ashvardanian/Stringzilla?
  - [ ] There's also an AVX512 variant in https://github.com/Kraionix/ImGuiMemchrBench/blob/master/ImMemchrBench/immemchr.h
- [ ] `memrchr()` and variants of above for `findLastAnyOf()`. [Linux has it](https://man.archlinux.org/man/memrchr.3), not sure where to look for a non-tainted implementation. At least we can benchmark against it as a black box.
- [ ] `memmem()` for faster general `find()` because of course `strstr()` is useless and I think we could do better than `memcmp()` in a loop. [Linux has it](https://man.archlinux.org/man/memmem.3), not sure where to look for a non-tainted implementation. At least we can benchmark against it as a black box.
  - [ ] and `memrmem()` for `findLast()`
  - [ ] if ever needed, a STL-less alternative to [`std::boyer_moore_searcher`](https://en.cppreference.com/w/cpp/utility/functional/boyer_moore_searcher) (but this one needs a lot of preprocessing state)
    - https://stackoverflow.com/questions/3183582/what-is-the-fastest-substring-search-algorithm
    - Exact Packed String Matching (2013) is the best?
    - https://smart-tool.github.io/smart/
    - check https://github.com/davidesantangelo/krep for anything useful (it picks SSE4.2 for certain search sizes, AVX2 is also advertised but not actually present)
- [ ] Levenshtein distance ("did you mean ...") implementation finding the closest string from a list when `Arguments`, plugin name, ... is mistyped --https://github.com/ninja-build/ninja/blob/ca041d88f4d610332aa48c801342edfafb622ccb/src/edit_distance.cc#L20-L69
    
# Comparison

- [ ] Check pointer equality before calling into `memcmp()` in `StringView::operator==()`, could save a lot especially when comparing literals that the compiler might have deduplicated
  - Don't do that in `String` tho
- [ ] Would we gain anything by implementing `memcmp()` ourselves?
  - especially for SSO strings that have a fixed size, which could be a single (masked) instruction?
  - by not having to explicitly test for `nullptr` when `size == 0` just to not hit an UB because the standard is stupid and generally disallows passing `nullptr` to any string/memory function even if the size is zero?
- [ ] Case-insensitive comparison -- http://www.phoronix.com/scan.php?page=news_item&px=Glibc-strcasecmp-AVX2-EVEX
  - Possibly useful for extension comparison in `Any*` plugins, OTOH there it's probably faster to normalize the extension first and then do 100 `memcmp()`s
  
# Unicode

- [ ] UTF-16 surrogates to UTF-8 and back, for JSON `\u` parsing
- [ ] SIMD-ified UTF-8 codepoint counting and validation -- https://lemire.me/blog/2020/10/20/ridiculously-fast-unicode-utf-8-validation/
- [ ] UTF-8 character counting including counting decomposed characters (such as `a` and `¨` as one), needs some Unicode categorization table
  - does https://github.com/ridiculousfish/widecharwidth work like that or not?

# Number-to-string

For `Utility::Debug`, `Utility::format()` etc. The core should be a direct overhead-less API working on builtin types (writing into a statically-sized `char[]`, e.g.), with convenience wrappers above.

- [ ] integer-to-string conversion instead of *stupidly* using `printf()` -- even a dumb `while` loop would be better
  - https://lemire.me/blog/2022/03/28/converting-integers-to-decimal-strings-faster-with-avx-512/
  - https://github.com/jeaiii/itoa (and explanation from https://jk-jeon.github.io/posts/2022/02/jeaiii-algorithm/), doesn't seem to use any explicit SIMD?
- [ ] separate optimized hex, bin and oct variant
  - https://lemire.me/blog/2021/05/17/converting-binary-integers-to-ascii-characters-apple-m1-vs-amd-zen2/
  - http://0x80.pl/articles/convert-to-hex.html
- [ ] float-to-string conversion instead of abusing `printf()`
  - [ ] Ryu is outdated, latest and greatest is https://github.com/jk-jeon/dragonbox but it's C++17, quite large with a lot of options we don't need, do a clean-room implementation from the paper alone? 
    - MSVC `to_chars` used Ryu, not sure if it's still: https://twitter.com/StephanTLavavej/status/992881364142702593
    - Older benchmarks here: https://github.com/miloyip/dtoa-benchmark
  - [ ] Explicit option to write or fail on NaN and Inf (for JSON compat)
  - [ ] Uppercase or lowercase exponent
  - [ ] Option to write hex floats (or a separate function that doesn't need the complex algo?)
- [ ] Any easy gains with a vectorized variant? "Convert 8 numbers to 8 strings at once"
  - Useful for JSON, OBJ, ... export
  
# String-to-number

Because `strto*()` has insane usability issues.

- [ ] string-to-integer conversion instead of *stupidly* using `strtol()` -- even a dumb `while` loop could be better
  - have 8-, 16-, 32- bit variants that check the range on their own
  - remove the stupid handling of `-` for unsigned types (who thought that was a good idea??)
  - https://lemire.me/blog/2022/01/21/swar-explained-parsing-eight-digits/
  - https://kholdstare.github.io/technical/2020/05/26/faster-integer-parsing.html and the lengthy post linked from there
- [ ] separate optimized hex, bin and oct variant
  - http://0x80.pl/notesen/2014-10-09-pext-convert-ascii-hex-to-num.html
- [ ] string-to-float conversion instead of using the shitty & slow `strtof()` / `strtod()`
  - [ ] What's the algorithm here? Some pointers could be in comments at https://lemire.me/blog/2020/09/10/parsing-floats-in-c-benchmarking-strtod-vs-from_chars/
  - [ ] Explicit option to parse or fail on NaN and Inf (for JSON compat)
  - [ ] Explicit option to write or ignore hex floats (or a separate function that doesn't need the complex algo?)
  - [ ] Also watch out for accumulated rounding errors when going to float through a double: https://www.exploringbinary.com/double-rounding-errors-in-decimal-to-double-to-float-conversions/
- [ ] Any easy gains with a vectorized variant? "Convert 8 strings to 8 numbers at once"
  - Useful for JSON, OBJ, ... parsing
  
# General printing

- Is some printf compatibility desired? like https://github.com/tfc/pprintpp
- Optimization ideas from Rust: https://internals.rust-lang.org/t/changing-core-fmt-for-speed/5154

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

String API additions and SIMD optimizations #129

SIMD in general

Construction

Searching

Comparison

Unicode

Number-to-string

String-to-number

General printing

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

String API additions and SIMD optimizations #129

Description

SIMD in general

Construction

Searching

Comparison

Unicode

Number-to-string

String-to-number

General printing

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions