-
Notifications
You must be signed in to change notification settings - Fork 220
Description
The special boundary values of CodePointTrie
are designed to line up with special boundaries in UTF-8 and, in the case of the fast mode, with the special boundary in UTF-16.
(For both trie types, 1-byte UTF-8, i.e. ASCII, can go directly to the data slice. For the fast trie type, the special boundary lines up with the boundary between 1-code-unit and 2-code-unit UTF-16 sequences and the boundary between 3-code-unit and 4-code-unit UTF-8 sequences. For the small type, the special boundary lines up with the boundary between 2-code-unit and 3-code-unit UTF-8 sequences.)
When we iterate over a slice of textual data by char
, we have to re-execute branches that UTF decode already executed.
We should have iterators over str
, potentially-not-well-formed UTF-8, UTF-16, and Latin1 that take a reference to a TypedCodePointTrie
in addition to taking a slice and yield pairs of char
and TrieValue
. (Or perhaps we need a trait that also untyped CodePointTrie
implements.)
For str
and Latin1, these belong in the ICU4X repo itself. For potentially-not-well-formed UTF-8 and UTF-16, perhaps the utf8_iter
and utf16_iter
crates should get a feature that brings is icu_collections
as an optional dependency to enable fusing the trie lookup into the UTF decoding.