Skip to content

Commit c11db10

Browse files
committed
utf8n_to_uvchr_msgs: More shortcut of UTF-8 invariants
This avoids the dfa table altogether by adding a test for UTF-8 invariants (ASCII-range characters), and not doing the table lookup at all for them. This removes the table lookup for these, and removes calculations in the return length value, and a potential jump. But the extra conditional is wasted for non-ASCII range. I consider this trade-off to be a wash, but it enables future simplifications
1 parent 4921e06 commit c11db10

File tree

1 file changed

+11
-8
lines changed

1 file changed

+11
-8
lines changed

inline.h

Lines changed: 11 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -3024,14 +3024,19 @@ Perl_utf8n_to_uvchr_msgs(const U8 *s,
30243024
flags, errors, msgs);
30253025
#endif
30263026

3027-
type = PL_strict_utf8_dfa_tab[*s];
3027+
/* UTF-8 invariants are returned unchanged. The code below is quite
3028+
* capable of handling this, but this shortcuts this very common case
3029+
* */
3030+
if (UTF8_IS_INVARIANT(*s)) {
3031+
if (retlen) {
3032+
*retlen = 1;
3033+
}
30283034

3029-
/* The table is structured so that 'type' is 0 iff the input byte is
3030-
* represented identically regardless of the UTF-8ness of the string */
3031-
if (type == 0) { /* UTF-8 invariants are returned unchanged */
3032-
uv = *s;
3035+
return *s;
30333036
}
3034-
else {
3037+
3038+
type = PL_strict_utf8_dfa_tab[*s];
3039+
30353040
UV state = PL_strict_utf8_dfa_tab[256 + type];
30363041
uv = (0xff >> type) & NATIVE_UTF8_TO_I8(*s);
30373042

@@ -3049,8 +3054,6 @@ Perl_utf8n_to_uvchr_msgs(const U8 *s,
30493054
/* Here is potentially problematic. Use the full mechanism */
30503055
return _utf8n_to_uvchr_msgs_helper(s0, curlen, retlen, flags,
30513056
errors, msgs);
3052-
}
3053-
30543057
success:
30553058
if (retlen) {
30563059
*retlen = s - s0 + 1;

0 commit comments

Comments
 (0)