Skip to content

Commit 83c2748

Browse files
committed
bytes_from_utf8: Copy initial invariants as-is
The paradigm used in this commit is in place in several other places in core. When dealing with UTF-8, it may well be that the first part of a string contains only characters that are the same when encoded as UTF-8 as when not. There is a function that finds the first position in a string not like that. It works on a whole word at a time instead of per-byte, effectively speeding things up by a factor of 8. In this case, calling that function tells us that we can use memcpy() to do the initial part of our task, before having to switch to looking at individual bytes.
1 parent 5a979ea commit 83c2748

File tree

1 file changed

+11
-1
lines changed

1 file changed

+11
-1
lines changed

utf8.c

Lines changed: 11 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2679,12 +2679,22 @@ Perl_bytes_from_utf8_loc(const U8 *s, STRLEN *lenp, bool *is_utf8p, const U8** f
26792679
}
26802680

26812681
const U8 * const s0 = s;
2682-
const U8 * send = s + *lenp;
2682+
const U8 * const send = s + *lenp;
2683+
const U8 * first_variant;
2684+
2685+
/* The initial portion of 's' that consists of invariants can be Copied
2686+
* as-is. If it is entirely invariant, the whole thing can be Copied. */
2687+
if (is_utf8_invariant_string_loc(s, *lenp, &first_variant)) {
2688+
first_variant = send;
2689+
}
26832690

26842691
U8 *d;
26852692
Newx(d, (*lenp) + 1, U8);
2693+
Copy(s, d, first_variant - s, U8);
26862694

26872695
U8 *converted_start = d;
2696+
d += first_variant - s;
2697+
s = first_variant;
26882698

26892699
while (s < send) {
26902700
U8 c = *s++;

0 commit comments

Comments
 (0)