-
Notifications
You must be signed in to change notification settings - Fork 220
Description
#1848 suggests that there's a need to further optimize CodePointTrie
performance in the segmenter. The segmenter gets the perf benefit of #6863 automatically. However, the perf benefit of #6906 needs changes to the calling code, and upon a quick look at the segmenter code, it might be impractical to break the abstraction of the UTF decoder in the segmenter the way the abstraction is broken on the UTF-16 fast path in the normalizer.
However, when looking it this, I noticed that the segmenter always accesses CodePointTrie
by get32
instead of get
. get32
adds an extra branch compared to get
, because get32
guarantees the error value for inputs above the Unicode range.
The iterator item type for the UTF-8 cases is char
and for Latin1 is u8
. These could always use get
. UTF-16 declares u32
as the item type, though. If the UTF-16 case declared char
as the item type, the generic code could always use get
for CodePointTrie
access.
AFAICT, the UTF-16 iteration code already performs the branches that are required to find out whether a surrogate pair is well-formed, so returning U+FFFD as char
would not add branches.
If I've misread something and returning U+FFFD from the iterator for unpaired surrogates would pessimize perf, it would be worthwhile to test adding a CodePointTrie
getter that took a u32
that is in the code point range (scalar value or surrogate) and that only guarantees that something memory-safe happens when the argument is above the Unicode range. Since the small lookup code is bound-checked, passing an above-range argument to it will do something memory-safe. The most likely semantic change would be that above-range arguments would be caught by the high_start
check, and above-range arguments would result in the default value instead of the error value.