Skip to content

Commit 0aa3d05

Browse files
committed
perlapi: Clarifications to utf8_to_uv entries
1 parent 12f6bd0 commit 0aa3d05

File tree

1 file changed

+49
-38
lines changed

1 file changed

+49
-38
lines changed

utf8.c

Lines changed: 49 additions & 38 deletions
Original file line numberDiff line numberDiff line change
@@ -1041,24 +1041,31 @@ There are two sets of these functions:
10411041
=item C<utf8_to_uv> forms
10421042
10431043
Almost all code should use only C<utf8_to_uv>, C<extended_utf8_to_uv>,
1044-
C<strict_utf8_to_uv>, or C<c9strict_utf8_to_uv>. The other functions are
1045-
either the problematic old form, or are for specialized uses.
1046-
1047-
These four functions each return C<true> if the sequence of bytes starting at
1048-
C<s> form a complete, legal UTF-8 (or UTF-EBCDIC) sequence for a code point.
1049-
If so, C<*cp> will be set to the native code point value it represents, and
1050-
C<*advance> will be set to its length, in bytes.
1051-
1052-
Otherwise, each function returns C<false> and sets C<*cp> to the Unicode
1053-
REPLACEMENT CHARACTER, and C<*advance> to the next position along C<s>, where
1054-
the next possible UTF-8 character could begin. Failing to use this position as
1055-
the next starting point during parsing of strings has led to successful
1056-
attacks by crafted inputs.
1044+
C<strict_utf8_to_uv>, C<c9strict_utf8_to_uv>, or C<utf8_to_uv_or_die>. The
1045+
other functions are either the problematic old form, or are for specialized
1046+
uses.
1047+
1048+
C<utf8_to_uv_or_die> has a simpler interface than the other four, for use when
1049+
any errors encountered should be fatal. It throws an exception with any errors
1050+
found, otherwise it returns the code point the input sequence represents.
1051+
1052+
The other four functions each return C<true> if the sequence of bytes starting
1053+
at C<s> form a complete, legal UTF-8 (or UTF-EBCDIC) sequence for a code point;
1054+
or false otherwise. They take an extra parameter, the address of an IV,
1055+
C<&cp>. C<*cp> will be set to the native code point value the sequence
1056+
represents, and C<*advance> will be set to its length, in bytes.
1057+
1058+
If the functions returns C<false>, C<*cp> is set to the Unicode REPLACEMENT
1059+
CHARACTER, and C<*advance> to the next position along C<s>, where the next
1060+
possible UTF-8 character could begin. Failing to use this position as the next
1061+
starting point during parsing of strings has led to successful attacks by
1062+
crafted inputs.
10571063
10581064
The functions only examine as many bytes along C<s> as are needed to form a
1059-
complete UTF-8 representation of a single code point, but they never examine
1060-
the byte at C<e>, or beyond. They return false if the code point requires more
1061-
than S<C<e - s>> bytes to represent.
1065+
complete UTF-8 representation of a single code point; they never examine the
1066+
byte at C<e>, or beyond. They return false (or die in the case of
1067+
C<utf8_to_uv_or_die>) if the code point requires more than S<C<e - s>> bytes to
1068+
represent.
10621069
10631070
The functions differ only in what flavor of UTF-8 they accept. All reject
10641071
syntactically invalid UTF-8.
@@ -1070,17 +1077,19 @@ syntactically invalid UTF-8.
10701077
additionally rejects any UTF-8 that translates into a code point that isn't
10711078
specified by Unicode to be freely exchangeable, namely the surrogate characters
10721079
and non-character code points (besides non-Unicode code points, any above
1073-
0x10FFFF). It does not raise a warning when rejecting.
1080+
0x10FFFF). It does not raise a warning when rejecting these.
10741081
10751082
=item * C<c9strict_utf8_to_uv>
10761083
10771084
instead uses the exchangeable definition given by Unicode's Corregendum #9,
10781085
which accepts non-character code points while still rejecting surrogates. It
1079-
does not raise a warning when rejecting.
1086+
does not raise a warning when rejecting these.
10801087
10811088
=item * C<utf8_to_uv>
10821089
1083-
accepts all syntactically valid UTF-8, as extended by Perl to allow 64-bit code
1090+
=item * C<utf8_to_uv_or die>
1091+
1092+
accept all syntactically valid UTF-8, as extended by Perl to allow 64-bit code
10841093
points to be encoded.
10851094
10861095
C<extended_utf8_to_uv> is merely a synonym for C<utf8_to_uv>. Use this form
@@ -1100,11 +1109,6 @@ sequence. You can use that function or C<L</utf8_to_uv_flags>> to exert more
11001109
control over the input that is considered acceptable, and the warnings that are
11011110
raised.
11021111
1103-
C<utf8_to_uv_or_die> has a simpler interface, for use when any errors are
1104-
fatal. It returns the code point instead of using an output parameter, and
1105-
throws an exception with any errors found where the other functions here would
1106-
have returned false.
1107-
11081112
Often, C<s> is an arbitrarily long string containing the UTF-8 representations
11091113
of many code points in a row, and these functions are called in the course of
11101114
parsing C<s> to find all those code points.
@@ -1416,10 +1420,10 @@ bit set for each malformation the function found; 0 if none. The C<ALLOW>-type
14161420
flags are ignored when determining the content of this variable. That is, even
14171421
if you "allow" a particular malformation, if it is encountered, the
14181422
corresponding bit will be set to notify you that one was encountered.
1419-
The bits for malformations that are accepted by default aren't set unless the
1420-
flags passed to the function indicate that they should be rejected or warned
1421-
about when encountering them. These malformations are explicitly noted in the
1422-
list below along with the controlling flags.
1423+
However, the bits for conditions that are accepted by default aren't set
1424+
unless the flags passed to the function indicate that they should be
1425+
rejected or warned about when encountering them. These are explicitly
1426+
noted in the list below along with the controlling flags.
14231427
14241428
The bits returned in C<errors> and their meanings are:
14251429
@@ -1532,15 +1536,16 @@ be rejected or warned about.
15321536
If you don't care about the system's messages text nor warning categories, you
15331537
can customize error handling by calling one of the C<_error> functions, using
15341538
either of the flags C<UTF8_ALLOW_ANY> or C<UTF8_CHECK_ONLY> to suppress any
1535-
warnings, and then examine the C<*errors> return.
1539+
warnings, and then examine the C<*errors> return. If you don't use those
1540+
flags, warnings will be raised as usual.
15361541
1537-
But if you do care, use one of the functions with C<_msgs> in their names.
1538-
These allow you to completely customize error handling by suppressing any
1539-
warnings that would otherwise be raised; instead returning all needed
1542+
But if you do care, instead use one of the functions with C<_msgs> in their
1543+
names. These allow you to completely customize error handling by suppressing
1544+
any warnings that would otherwise be raised; instead returning all relevant
15401545
information in a structure specified by an extra parameter, C<msgs>, a pointer
15411546
to a variable which has been declared to be an C<AV*>, and into which the
1542-
function creates a new AV to store information, described below, about all
1543-
the malformations that were encountered.
1547+
function creates a new AV to store information, described below, about all the
1548+
malformations that were encountered.
15441549
15451550
If the flag C<UTF8_CHECK_ONLY> is passed, this parameter is ignored.
15461551
Otherwise, when this parameter is set, the flags C<UTF8_DIE_IF_MALFORMED> and
@@ -1549,7 +1554,7 @@ C<UTF8_FORCE_WARN_IF_MALFORMED> are ignored.
15491554
What is considered a malformation is affected by C<flags>, the same as
15501555
described in C<L</utf8_to_uv_flags>>. No array element is generated for
15511556
malformations that are "allowed" by the input flags, in contrast to the
1552-
C<_error> functions.
1557+
bitmap returned in a non-NULL C<*errors>.
15531558
15541559
Each element of the C<msgs> AV array is an anonymous hash with the following
15551560
three key-value pairs:
@@ -1558,12 +1563,18 @@ three key-value pairs:
15581563
15591564
=item C<text>
15601565
1561-
A C<SVpv> containing the text of any warning message that would have ordinarily
1562-
been generated. The function suppresses raising this warning itself.
1566+
A C<SVpv> containing the text of the message about the problematic input.
1567+
This text is identical to any warning that otherwise would have been raised if
1568+
the appropriate warning categories were enabled.
15631569
15641570
=item C<warn_categories>
15651571
1566-
The warning category (or categories) for the message, packed into a C<SVuv>.
1572+
This is 0 if the C<flags> parameter to the function would ordinarily not have
1573+
caused the message to be output as a warning; otherwise it is the warning
1574+
category (or categories) that would have been used to generate a warning for
1575+
C<text>, packed into a C<SVuv>. For example, if C<flags> contains
1576+
C<UTF8_DISALLOW_SURROGATE>, but not C<UTF8_WARN_SURROGATE>, this would be 0 if
1577+
the input was a surrogate.
15671578
15681579
=item C<flag>
15691580

0 commit comments

Comments
 (0)