Skip to content

Commit 5718b26

Browse files
committed
Improve UTF-8 overflow/overlong handling
Perl's extended UTF-8 is capable of representing code points up to 2**72 (2**65 on EBCDIC). These won't fit in a 64 bit word, and hence overflow. (And much more so on 32 bit machines.) A start byte of \xFF is required for code points starting with 2**36; \xFE for those starting with 2**31, and so on. But it turns out that a sequence beginning with \xFE can express all code points 0..2**36-1, and \xFF sequences can express everything 0..2**72-1. When a sequence represents a code point that can be expressed by a shorter sequence, it is called an overlong, and using those is expressly forbidden by the Unicode standard due to spoofing attacks that have occurred. So, a \xFE start byte should only be used for code points in the range 2**31..2**36-1; and \xFF only for 2**36..2**65-1. But Perl needs to handle the possibility where the input doesn't match the expectations of what it should be. We have tried to determine all the malformations that apply to a given sequence and return them to the caller when requested. The interplay between overflow and overlong is somewhat tricky, and the new tests that are to be added in the next commit showed that we haven't been doing it completely right. Prior to this commit, the checks for both overlong and overflow had three states: yes, no, and maybe. The last meaning that the sequence being examined was shorter than a full character, and that some possible completions of it would result in yes, and some would result in no. This commit retains the tripartite state of examining a sequence for being overlong, but adds a fourth state for overflow, namely that the input overflows unless the sequence is overlong, and there aren't enough bytes to determine the latter absolutely for sure. But overlongs are rare, so the chances of it being that are tiny, so this state means that it almost certainly overflows. Prior to this commit, I had tried to cope with some of this by an extra parameter to the find-if-overflow function, but this fourth state removes the need for that. The caller gets which state the input is, and then chooses ow to handle it, without needing the parameter. The tests in utf8decode.t also had to be changes, as this new code picks up some overflows on 32-bit machines that were previously not caught.
1 parent b015ed3 commit 5718b26

File tree

4 files changed

+42
-58
lines changed

4 files changed

+42
-58
lines changed

embed.fnc

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -5941,8 +5941,7 @@ RS |UV |check_locale_boundary_crossing \
59415941
|NN STRLEN *lenp
59425942
RTi |int |does_utf8_overflow \
59435943
|NN const U8 * const s \
5944-
|NN const U8 *e \
5945-
|const bool consider_overlongs
5944+
|NN const U8 *e
59465945
RTi |int |isFF_overlong |NN const U8 * const s \
59475946
|const STRLEN len
59485947
Ri |bool |is_utf8_common |NN const U8 * const p \

proto.h

Lines changed: 1 addition & 1 deletion
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

t/op/utf8decode.t

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -189,9 +189,9 @@ __DATA__
189189
3.4 Concatenation of incomplete sequences
190190
3.4.1 N15 - 30 c0:e0:80:f0:80:80:f8:80:80:80:fc:80:80:80:80:df:ef:bf:f7:bf:bf:fb:bf:bf:bf:fd:bf:bf:bf:bf - unexpected non-continuation byte 0xe0, immediately after start byte 0xc0
191191
3.5 Impossible bytes (but not with Perl's extended UTF-8)
192-
3.5.1 n - 1 fe - 1 byte available, need 7
193-
3.5.2 n - 1 ff - 1 byte available, need 13
194-
3.5.3 N7 - 4 fe:fe:ff:ff - byte 0xfe
192+
3.5.1 N2,1 - 1 fe - 1 byte available, need 7
193+
3.5.2 N2,1 - 1 ff - 1 byte available, need 13
194+
3.5.3 N11,7 - 4 fe:fe:ff:ff - byte 0xfe
195195
4 Overlong sequences
196196
4.1 Examples of an overlong ASCII character
197197
4.1.1 n - 2 c0:af - overlong

utf8.c

Lines changed: 37 additions & 52 deletions
Original file line numberDiff line numberDiff line change
@@ -597,31 +597,31 @@ S_isFF_overlong(const U8 * const s, const STRLEN len)
597597
#endif
598598

599599
PERL_STATIC_INLINE int
600-
S_does_utf8_overflow(const U8 * const s,
601-
const U8 * e,
602-
const bool consider_overlongs)
600+
S_does_utf8_overflow(const U8 * const s, const U8 * e)
603601
{
604602
PERL_ARGS_ASSERT_DOES_UTF8_OVERFLOW;
605603

606604
/* Returns an int indicating whether or not the UTF-8 sequence from 's' to
607605
* 'e' - 1 would overflow an IV on this platform; that is if it represents
608-
* a code point larger than the highest representable code point. It
609-
* returns 1 if it does overflow; 0 if it doesn't, and -1 if there isn't
610-
* enough information to tell. This last return value can happen if the
611-
* sequence is incomplete, missing some trailing bytes that would form a
612-
* complete character. If there are enough bytes to make a definitive
613-
* decision, this function does so.
614-
*
615-
* If 'consider_overlongs' is TRUE, the function checks for the possibility
616-
* that the sequence is an overlong that doesn't overflow. Otherwise, it
617-
* assumes the sequence is not an overlong. This can give different
618-
* results only on ASCII 32-bit platforms.
619-
*
620-
* (For ASCII platforms, we could use memcmp() because we don't have to
621-
* convert each byte to I8, but it's very rare input indeed that would
622-
* approach overflow, so the loop below will likely only get executed once.)
623-
*
624-
*/
606+
* a code point larger than the highest representable code point. The
607+
* possible returns are: */
608+
#define NO_OVERFLOW 0 /* Definitely doesn't overflow */
609+
610+
/* There aren't enough examinable bytes available to be sure. This can happen
611+
* if the sequence is incomplete, missing some trailing bytes that would form a
612+
* complete character. */
613+
#define COULD_OVERFLOW 1
614+
615+
/* This overflows if not also overlong, and like COULD_OVERFLOW, there aren't
616+
* enough available bytes to be sure, but since overlongs are very rarely
617+
* encountered, for most purposes consider it to overflow */
618+
#define ALMOST_CERTAINLY_OVERFLOWS 2
619+
620+
#define OVERFLOWS 3 /* Definitely overflows */
621+
622+
/* Note that the values are ordered so that you can use '>=' in checking
623+
* the return value. */
624+
625625
const STRLEN len = e - s;
626626
const U8 *x;
627627
const U8 * y = (const U8 *) HIGHEST_REPRESENTABLE_UTF;
@@ -634,13 +634,13 @@ S_does_utf8_overflow(const U8 * const s,
634634
* bytes larger than those omitted bytes, and therefore 'x' can't
635635
* overflow */
636636
if (*y == '\0') {
637-
return 0;
637+
return NO_OVERFLOW;
638638
}
639639

640640
/* If this byte is less than the corresponding highest non-overflowing
641641
* UTF-8, the sequence doesn't overflow */
642642
if (NATIVE_UTF8_TO_I8(*x) < *y) {
643-
return 0;
643+
return NO_OVERFLOW;
644644
}
645645

646646
if (UNLIKELY(NATIVE_UTF8_TO_I8(*x) > *y)) {
@@ -651,30 +651,20 @@ S_does_utf8_overflow(const U8 * const s,
651651
/* Got to the end, and all bytes are the same. If the input is a whole
652652
* character, it doesn't overflow. And if it is a partial character,
653653
* there's not enough information to tell */
654-
return (len >= STRLENs(HIGHEST_REPRESENTABLE_UTF)) ? 0 : -1;
654+
return (len >= STRLENs(HIGHEST_REPRESENTABLE_UTF)) ? NO_OVERFLOW
655+
: COULD_OVERFLOW;
655656

656657
overflows_if_not_overlong: ;
657658

658-
/* Here, a well-formed sequence overflows. If we are assuming
659-
* well-formedness, return that it overflows. */
660-
if (! consider_overlongs) {
661-
return 1;
662-
}
663-
664-
/* Here, it could be the overlong malformation, and might not actually
665-
* overflow if you were to calculate it out.
666-
*
667-
* See if it actually is overlong */
659+
/* Here, the sequence overflows if not overlong. Check for that */
668660
int is_overlong = is_utf8_overlong(s, len);
669-
670-
/* If it isn't overlong, is well-formed, so overflows */
671-
if (is_overlong == 0) {
672-
return 1;
661+
if (LIKELY(is_overlong == 0)) {
662+
return OVERFLOWS;
673663
}
674664

675665
/* Not long enough to determine */
676666
if (is_overlong < 0) {
677-
return -1;
667+
return ALMOST_CERTAINLY_OVERFLOWS;
678668
}
679669

680670
/* Here, it appears to overflow, but it is also overlong. That overlong
@@ -705,7 +695,7 @@ S_does_utf8_overflow(const U8 * const s,
705695
* UTF_CONTINUATION_BYTE_INFO_BITS each. If that number of bits doesn't
706696
* exceed the word size, it can't overflow. */
707697

708-
return 0;
698+
return NO_OVERFLOW;
709699

710700
#else
711701

@@ -717,7 +707,7 @@ S_does_utf8_overflow(const U8 * const s,
717707
*
718708
* That means only the FF start byte can have an overflowing overlong. */
719709
if (*s < 0xFF) {
720-
return 0;
710+
return NO_OVERFLOW;
721711
}
722712

723713
/* The sequence \xff\x80\x80\x80\x80\x80\x80\x82 is an overlong that
@@ -726,12 +716,14 @@ S_does_utf8_overflow(const U8 * const s,
726716
# define OVERFLOWS_MIN_STRING "\xff\x80\x80\x80\x80\x80\x80\x82"
727717

728718
if (e - s < (Ptrdiff_t) STRLENs(OVERFLOWS_MIN_STRING)) {
729-
return -1; /* Not enough info to be sure */
719+
return ALMOST_CERTAINLY_OVERFLOWS; /* Not enough info to be sure */
730720
}
731721

732722
# define strnGE(s1,s2,l) (strncmp(s1,s2,l) >= 0)
733723

734-
return (strnGE((const char *) s, OVERFLOWS_MIN_STRING, STRLENs(OVERFLOWS_MIN_STRING)));
724+
return (strnGE((const char *) s, OVERFLOWS_MIN_STRING, STRLENs(OVERFLOWS_MIN_STRING)))
725+
? OVERFLOWS
726+
: NO_OVERFLOW;
735727

736728
#endif
737729

@@ -897,9 +889,7 @@ Perl_is_utf8_FF_helper_(const U8 * const s0, const U8 * const e,
897889
s++;
898890
}
899891

900-
if (0 < does_utf8_overflow(s0, e,
901-
FALSE /* Don't consider_overlongs */
902-
)) {
892+
if (does_utf8_overflow(s0, e) == OVERFLOWS) {
903893
return 0;
904894
}
905895

@@ -1569,10 +1559,7 @@ Perl__utf8n_to_uvchr_msgs_helper(const U8 *s,
15691559

15701560
/* Check for overflow. The algorithm requires us to not look past the end
15711561
* of the current character, even if partial, so the upper limit is 's' */
1572-
if (UNLIKELY(0 < does_utf8_overflow(s0, s,
1573-
1 /* Do consider overlongs */
1574-
)))
1575-
{
1562+
if (UNLIKELY(does_utf8_overflow(s0, s) >= ALMOST_CERTAINLY_OVERFLOWS)) {
15761563
possible_problems |= UTF8_GOT_OVERFLOW;
15771564
uv = UNICODE_REPLACEMENT;
15781565
}
@@ -4126,9 +4113,7 @@ Perl_check_utf8_print(pTHX_ const U8* s, const STRLEN len)
41264113
if (UNLIKELY(isUTF8_POSSIBLY_PROBLEMATIC(*s))) {
41274114
if (UNLIKELY(UTF8_IS_SUPER(s, e))) {
41284115
if ( ckWARN_d(WARN_NON_UNICODE)
4129-
|| UNLIKELY(0 < does_utf8_overflow(s, s + len,
4130-
0 /* Don't consider overlongs */
4131-
)))
4116+
|| UNLIKELY(does_utf8_overflow(s, s + len) >= ALMOST_CERTAINLY_OVERFLOWS))
41324117
{
41334118
/* A side effect of this function will be to warn */
41344119
(void) utf8n_to_uvchr(s, e - s, NULL, UTF8_WARN_SUPER);

0 commit comments

Comments
 (0)