Add iterators over UTF/Latin1 slices that fuse a CodePointTrie lookup

The special boundary values of `CodePointTrie` are designed to line up with special boundaries in UTF-8 and, in the case of the fast mode, with the special boundary in UTF-16.

(For both trie types, 1-byte UTF-8, i.e. ASCII, can go directly to the data slice. For the fast trie type, the special boundary lines up with the boundary between 1-code-unit and 2-code-unit UTF-16 sequences and the boundary between 3-code-unit and 4-code-unit UTF-8 sequences. For the small type, the special boundary lines up with the boundary between 2-code-unit and 3-code-unit UTF-8 sequences.)

When we iterate over a slice of textual data by `char`, we have to re-execute branches that UTF decode already executed.

We should have iterators over `str`, potentially-not-well-formed UTF-8, UTF-16, and Latin1 that take a reference to a `TypedCodePointTrie` in addition to taking a slice and yield pairs of `char` and `TrieValue`. (Or perhaps we need a trait that also untyped `CodePointTrie` implements.)

For `str` and Latin1, these belong in the ICU4X repo itself. For potentially-not-well-formed UTF-8 and UTF-16, perhaps the `utf8_iter` and `utf16_iter` crates should get a feature that brings is `icu_collections` as an optional dependency to enable fusing the trie lookup into the UTF decoding.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add iterators over UTF/Latin1 slices that fuse a CodePointTrie lookup #6925

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Add iterators over UTF/Latin1 slices that fuse a CodePointTrie lookup #6925

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions