Consider making the segmenter access CodePointTrie in a way that omits the out of range check

#1848 suggests that there's a need to further optimize `CodePointTrie` performance in the segmenter. The segmenter gets the perf benefit of #6863 automatically. However, the perf benefit of #6906 needs changes to the calling code, and upon a quick look at the segmenter code, it might be impractical to break the abstraction of the UTF decoder in the segmenter the way the abstraction is broken on the UTF-16 fast path in the normalizer.

However, when looking it this, I noticed that the segmenter always accesses `CodePointTrie` by `get32` instead of `get`. `get32` adds an extra branch compared to `get`, because `get32` guarantees the error value for inputs above the Unicode range.

The iterator item type for the UTF-8 cases is `char` and for Latin1 is `u8`. These could always use `get`. UTF-16 declares `u32` as the item type, though. If the UTF-16 case declared `char` as the item type, the generic code could always use `get` for `CodePointTrie` access.

AFAICT, the UTF-16 iteration code already performs the branches that are required to find out whether a surrogate pair is well-formed, so returning U+FFFD as `char` would not add branches.

If I've misread something and returning U+FFFD from the iterator for unpaired surrogates would pessimize perf, it would be worthwhile to test adding a `CodePointTrie` getter that took a `u32` that is in the code point range (scalar value or surrogate) and that only guarantees that something memory-safe happens when the argument is above the Unicode range. Since the small lookup code is bound-checked, passing an above-range argument to it will do something memory-safe. The most likely semantic change would be that above-range arguments would be caught by the `high_start` check, and above-range arguments would result in the default value instead of the error value.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Consider making the segmenter access CodePointTrie in a way that omits the out of range check #6923

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Consider making the segmenter access CodePointTrie in a way that omits the out of range check #6923

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions