Skip to content

Consider making the segmenter access CodePointTrie in a way that omits the out of range check #6923

@hsivonen

Description

@hsivonen

#1848 suggests that there's a need to further optimize CodePointTrie performance in the segmenter. The segmenter gets the perf benefit of #6863 automatically. However, the perf benefit of #6906 needs changes to the calling code, and upon a quick look at the segmenter code, it might be impractical to break the abstraction of the UTF decoder in the segmenter the way the abstraction is broken on the UTF-16 fast path in the normalizer.

However, when looking it this, I noticed that the segmenter always accesses CodePointTrie by get32 instead of get. get32 adds an extra branch compared to get, because get32 guarantees the error value for inputs above the Unicode range.

The iterator item type for the UTF-8 cases is char and for Latin1 is u8. These could always use get. UTF-16 declares u32 as the item type, though. If the UTF-16 case declared char as the item type, the generic code could always use get for CodePointTrie access.

AFAICT, the UTF-16 iteration code already performs the branches that are required to find out whether a surrogate pair is well-formed, so returning U+FFFD as char would not add branches.

If I've misread something and returning U+FFFD from the iterator for unpaired surrogates would pessimize perf, it would be worthwhile to test adding a CodePointTrie getter that took a u32 that is in the code point range (scalar value or surrogate) and that only guarantees that something memory-safe happens when the argument is above the Unicode range. Since the small lookup code is bound-checked, passing an above-range argument to it will do something memory-safe. The most likely semantic change would be that above-range arguments would be caught by the high_start check, and above-range arguments would result in the default value instead of the error value.

Metadata

Metadata

Assignees

No one assigned

    Labels

    A-performanceArea: Performance (CPU, Memory)C-segmentationComponent: Segmentation

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions