Skip to content

Commit 15d8147

Browse files
committed
Add utf8_to_uv_or_die()
This new function dies if the UTF-8 input to it is malformed. There are quite a few places in the core where we expect the input to be wellformed, and just assume that it is. This function is a drop-in replacement for those, and we won't blindly continue if the assumption is wrong. There are also a bunch of places that don't make that assumption, but check it and die immediately if malformed. This function replaces those too, along with the code needed to test the return and die.
1 parent c60ca96 commit 15d8147

File tree

6 files changed

+54
-12
lines changed

6 files changed

+54
-12
lines changed

embed.fnc

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3780,6 +3780,10 @@ CTp |bool |utf8_to_uv_msgs_helper_ \
37803780
|U32 flags \
37813781
|NULLOK U32 *errors \
37823782
|NULLOK AV **msgs
3783+
ATdip |UV |utf8_to_uv_or_die \
3784+
|NN const U8 * const s \
3785+
|NN const U8 *e \
3786+
|NULLOK Size_t *advance_p
37833787
CDbdp |UV |utf8_to_uvuni |NN const U8 *s \
37843788
|NULLOK STRLEN *retlen
37853789
: Used in perly.y

embed.h

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -870,6 +870,7 @@
870870
# define utf8_to_uv_flags Perl_utf8_to_uv_flags
871871
# define utf8_to_uv_msgs Perl_utf8_to_uv_msgs
872872
# define utf8_to_uv_msgs_helper_ Perl_utf8_to_uv_msgs_helper_
873+
# define utf8_to_uv_or_die Perl_utf8_to_uv_or_die
873874
# define utf8n_to_uvchr Perl_utf8n_to_uvchr
874875
# define utf8n_to_uvchr_error Perl_utf8n_to_uvchr_error
875876
# define utf8n_to_uvchr_msgs Perl_utf8n_to_uvchr_msgs

inline.h

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3138,6 +3138,16 @@ Perl_utf8_to_uv_msgs(const U8 * const s0,
31383138
return utf8_to_uv_msgs_helper_(s0, e, cp_p, advance_p, flags, errors, msgs);
31393139
}
31403140

3141+
PERL_STATIC_INLINE UV
3142+
Perl_utf8_to_uv_or_die(const U8 *s, const U8 *e, STRLEN *advance_p)
3143+
{
3144+
PERL_ARGS_ASSERT_UTF8_TO_UV_OR_DIE;
3145+
3146+
UV cp;
3147+
(void) utf8_to_uv_flags(s, e, &cp, advance_p, UTF8_DIE_IF_MALFORMED);
3148+
return cp;
3149+
}
3150+
31413151
PERL_STATIC_INLINE UV
31423152
Perl_utf8n_to_uvchr_msgs(const U8 * const s0,
31433153
STRLEN curlen,

pod/perldelta.pod

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -436,9 +436,10 @@ New API functions are introduced to convert strings encoded in UTF-8 to
436436
their ordinal code point equivalent. These are safe to use by default,
437437
and generally more convenient to use than the existing ones.
438438

439-
L<perlapi/C<utf8_to_uv>> replaces L<perlapi/C<utf8_to_uvchr>> (which is
440-
retained for backwards compatibility), but you should convert to use the
441-
new form, as likely you aren't using the old one safely.
439+
L<perlapi/C<utf8_to_uv>> and L<perlapi/C<utf8_to_uv_or_die>> replace
440+
L<perlapi/C<utf8_to_uvchr>> (which is retained for backwards
441+
compatibility), but you should convert to use the new forms, as likely
442+
you aren't using the old one safely.
442443

443444
To convert in the opposite direction, you can now use
444445
L<perlapi/C<uv_to_utf8>>. This is not a new function, but a new synonym

proto.h

Lines changed: 5 additions & 0 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

utf8.c

Lines changed: 30 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -1003,6 +1003,7 @@ S_unexpected_non_continuation_text(pTHX_ const U8 * const s,
10031003
=for apidoc_item extended_utf8_to_uv
10041004
=for apidoc_item strict_utf8_to_uv
10051005
=for apidoc_item c9strict_utf8_to_uv
1006+
=for apidoc_item utf8_to_uv_or_die
10061007
=for apidoc_item utf8_to_uvchr_buf
10071008
=for apidoc_item utf8_to_uvchr
10081009
@@ -1099,6 +1100,11 @@ sequence. You can use that function or C<L</utf8_to_uv_flags>> to exert more
10991100
control over the input that is considered acceptable, and the warnings that are
11001101
raised.
11011102
1103+
C<utf8_to_uv_or_die> has a simpler interface, for use when any errors are
1104+
fatal. It returns the code point instead of using an output parameter, and
1105+
throws an exception with any errors found where the other functions here would
1106+
have returned false.
1107+
11021108
Often, C<s> is an arbitrarily long string containing the UTF-8 representations
11031109
of many code points in a row, and these functions are called in the course of
11041110
parsing C<s> to find all those code points.
@@ -1107,8 +1113,8 @@ If your code doesn't know how to deal with illegal input, as would be typical
11071113
of a low level routine, the loop could look like:
11081114
11091115
while (s < e) {
1110-
UV cp;
11111116
Size_t advance;
1117+
UV cp;
11121118
(void) utf8_to_uv(s, e, &cp, &advance);
11131119
<handle 'cp'>
11141120
s += advance;
@@ -1118,11 +1124,24 @@ A REPLACEMENT CHARACTER will be inserted everywhere that malformed input
11181124
occurs. Obviously, we aren't expecting such outcomes, but your code will be
11191125
protected from attacks and many harmful effects that could otherwise occur.
11201126
1127+
If the situation is such that it would be a bug for the input to be invalid, a
1128+
somewhat simpler loop suffices:
1129+
1130+
while (s < e) {
1131+
Size_t advance;
1132+
UV cp = utf8_to_uv_or_die(s, e, &advance);
1133+
<handle 'cp'>
1134+
s += advance;
1135+
}
1136+
1137+
This will throw an exception on invalid input, so your code doesn't have to
1138+
concern itself with that possibility.
1139+
11211140
If you do have a plan for handling malformed input, you could instead write:
11221141
11231142
while (s < e) {
1124-
UV cp;
11251143
Size_t advance;
1144+
UV cp;
11261145
11271146
if (UNLIKELY(! utf8_to_uv(s, e, &cp, &advance)) {
11281147
<bail out or convert to handleable>
@@ -1142,9 +1161,10 @@ attacks against such code; and it is extra work always, as the functions have
11421161
already done the equivalent work and return the correct value in C<advance>,
11431162
regardless of whether the input is well-formed or not.
11441163
1145-
You must always pass a non-NULL pointer into which to store the (first) code
1146-
point C<s> represents. If you don't care about this value, you should be using
1147-
one of the C<L</isUTF8_CHAR>> functions instead.
1164+
Except with C<utf8_to_uv_or_die>, you must always pass a non-NULL pointer into
1165+
which to store the (first) code point C<s> represents. If you don't care about
1166+
this value, you should be using one of the C<L</isUTF8_CHAR>> functions
1167+
instead.
11481168
11491169
=item C<utf8_to_uvchr> forms
11501170
@@ -1274,8 +1294,8 @@ This flag is ignored if C<UTF8_CHECK_ONLY> is also set.
12741294
=item C<UTF8_WARN_SURROGATE>
12751295
12761296
These reject and/or warn about UTF-8 sequences that represent surrogate
1277-
characters. The warning categories C<utf8> and C<super> control if warnings
1278-
are actually raised.
1297+
characters. The warning categories C<utf8> and C<non_unicode> control if
1298+
warnings are actually raised.
12791299
12801300
=item C<UTF8_DISALLOW_NONCHAR>
12811301
@@ -1290,7 +1310,7 @@ are actually raised.
12901310
=item C<UTF8_WARN_SUPER>
12911311
12921312
These reject and/or warn about UTF-8 sequences that represent code points
1293-
above 0x10FFFF. The warning categories C<utf8> and C<super> control if
1313+
above 0x10FFFF. The warning categories C<utf8> and C<non_unicode> control if
12941314
warnings are actually raised.
12951315
12961316
=item C<UTF8_DISALLOW_ILLEGAL_INTERCHANGE>
@@ -1324,7 +1344,8 @@ These reject and/or warn on encountering sequences that require Perl's
13241344
extension to UTF-8 to represent them. These are all for code points above
13251345
0x10FFFF, so these sequences are a subset of the ones controlled by SUPER or
13261346
either of the illegal interchange sets of flags. The warning categories
1327-
C<utf8>, C<super>, and C<portable> control if warnings are actually raised.
1347+
C<utf8>, C<non_unicode>, and C<portable> control if warnings are actually
1348+
raised.
13281349
13291350
Perl predates Unicode, and earlier standards allowed for code points up through
13301351
0x7FFF_FFFF (2**31 - 1). Perl, of course, would like you to be able to

0 commit comments

Comments
 (0)