Skip to content

Add iterators over UTF/Latin1 slices that fuse a CodePointTrie lookup #6925

@hsivonen

Description

@hsivonen

The special boundary values of CodePointTrie are designed to line up with special boundaries in UTF-8 and, in the case of the fast mode, with the special boundary in UTF-16.

(For both trie types, 1-byte UTF-8, i.e. ASCII, can go directly to the data slice. For the fast trie type, the special boundary lines up with the boundary between 1-code-unit and 2-code-unit UTF-16 sequences and the boundary between 3-code-unit and 4-code-unit UTF-8 sequences. For the small type, the special boundary lines up with the boundary between 2-code-unit and 3-code-unit UTF-8 sequences.)

When we iterate over a slice of textual data by char, we have to re-execute branches that UTF decode already executed.

We should have iterators over str, potentially-not-well-formed UTF-8, UTF-16, and Latin1 that take a reference to a TypedCodePointTrie in addition to taking a slice and yield pairs of char and TrieValue. (Or perhaps we need a trait that also untyped CodePointTrie implements.)

For str and Latin1, these belong in the ICU4X repo itself. For potentially-not-well-formed UTF-8 and UTF-16, perhaps the utf8_iter and utf16_iter crates should get a feature that brings is icu_collections as an optional dependency to enable fusing the trie lookup into the UTF decoding.

Metadata

Metadata

Assignees

No one assigned

    Labels

    A-performanceArea: Performance (CPU, Memory)C-collatorComponent: Collation, normalizationC-segmentationComponent: SegmentationC-unicodeComponent: Props, sets, tries

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions