@@ -1041,24 +1041,31 @@ There are two sets of these functions:
1041
1041
=item C<utf8_to_uv> forms
1042
1042
1043
1043
Almost all code should use only C<utf8_to_uv>, C<extended_utf8_to_uv>,
1044
- C<strict_utf8_to_uv>, or C<c9strict_utf8_to_uv>. The other functions are
1045
- either the problematic old form, or are for specialized uses.
1046
-
1047
- These four functions each return C<true> if the sequence of bytes starting at
1048
- C<s> form a complete, legal UTF-8 (or UTF-EBCDIC) sequence for a code point.
1049
- If so, C<*cp> will be set to the native code point value it represents, and
1050
- C<*advance> will be set to its length, in bytes.
1051
-
1052
- Otherwise, each function returns C<false> and sets C<*cp> to the Unicode
1053
- REPLACEMENT CHARACTER, and C<*advance> to the next position along C<s>, where
1054
- the next possible UTF-8 character could begin. Failing to use this position as
1055
- the next starting point during parsing of strings has led to successful
1056
- attacks by crafted inputs.
1044
+ C<strict_utf8_to_uv>, C<c9strict_utf8_to_uv>, or C<utf8_to_uv_or_die>. The
1045
+ other functions are either the problematic old form, or are for specialized
1046
+ uses.
1047
+
1048
+ C<utf8_to_uv_or_die> has a simpler interface than the other four, for use when
1049
+ any errors encountered should be fatal. It throws an exception with any errors
1050
+ found, otherwise it returns the code point the input sequence represents.
1051
+
1052
+ The other four functions each return C<true> if the sequence of bytes starting
1053
+ at C<s> form a complete, legal UTF-8 (or UTF-EBCDIC) sequence for a code point;
1054
+ or false otherwise. They take an extra parameter, the address of an IV,
1055
+ C<&cp>. C<*cp> will be set to the native code point value the sequence
1056
+ represents, and C<*advance> will be set to its length, in bytes.
1057
+
1058
+ If the functions returns C<false>, C<*cp> is set to the Unicode REPLACEMENT
1059
+ CHARACTER, and C<*advance> to the next position along C<s>, where the next
1060
+ possible UTF-8 character could begin. Failing to use this position as the next
1061
+ starting point during parsing of strings has led to successful attacks by
1062
+ crafted inputs.
1057
1063
1058
1064
The functions only examine as many bytes along C<s> as are needed to form a
1059
- complete UTF-8 representation of a single code point, but they never examine
1060
- the byte at C<e>, or beyond. They return false if the code point requires more
1061
- than S<C<e - s>> bytes to represent.
1065
+ complete UTF-8 representation of a single code point; they never examine the
1066
+ byte at C<e>, or beyond. They return false (or die in the case of
1067
+ C<utf8_to_uv_or_die>) if the code point requires more than S<C<e - s>> bytes to
1068
+ represent.
1062
1069
1063
1070
The functions differ only in what flavor of UTF-8 they accept. All reject
1064
1071
syntactically invalid UTF-8.
@@ -1070,17 +1077,19 @@ syntactically invalid UTF-8.
1070
1077
additionally rejects any UTF-8 that translates into a code point that isn't
1071
1078
specified by Unicode to be freely exchangeable, namely the surrogate characters
1072
1079
and non-character code points (besides non-Unicode code points, any above
1073
- 0x10FFFF). It does not raise a warning when rejecting.
1080
+ 0x10FFFF). It does not raise a warning when rejecting these .
1074
1081
1075
1082
=item * C<c9strict_utf8_to_uv>
1076
1083
1077
1084
instead uses the exchangeable definition given by Unicode's Corregendum #9,
1078
1085
which accepts non-character code points while still rejecting surrogates. It
1079
- does not raise a warning when rejecting.
1086
+ does not raise a warning when rejecting these .
1080
1087
1081
1088
=item * C<utf8_to_uv>
1082
1089
1083
- accepts all syntactically valid UTF-8, as extended by Perl to allow 64-bit code
1090
+ =item * C<utf8_to_uv_or die>
1091
+
1092
+ accept all syntactically valid UTF-8, as extended by Perl to allow 64-bit code
1084
1093
points to be encoded.
1085
1094
1086
1095
C<extended_utf8_to_uv> is merely a synonym for C<utf8_to_uv>. Use this form
@@ -1100,11 +1109,6 @@ sequence. You can use that function or C<L</utf8_to_uv_flags>> to exert more
1100
1109
control over the input that is considered acceptable, and the warnings that are
1101
1110
raised.
1102
1111
1103
- C<utf8_to_uv_or_die> has a simpler interface, for use when any errors are
1104
- fatal. It returns the code point instead of using an output parameter, and
1105
- throws an exception with any errors found where the other functions here would
1106
- have returned false.
1107
-
1108
1112
Often, C<s> is an arbitrarily long string containing the UTF-8 representations
1109
1113
of many code points in a row, and these functions are called in the course of
1110
1114
parsing C<s> to find all those code points.
@@ -1416,10 +1420,10 @@ bit set for each malformation the function found; 0 if none. The C<ALLOW>-type
1416
1420
flags are ignored when determining the content of this variable. That is, even
1417
1421
if you "allow" a particular malformation, if it is encountered, the
1418
1422
corresponding bit will be set to notify you that one was encountered.
1419
- The bits for malformations that are accepted by default aren't set unless the
1420
- flags passed to the function indicate that they should be rejected or warned
1421
- about when encountering them. These malformations are explicitly noted in the
1422
- list below along with the controlling flags.
1423
+ However, the bits for conditions that are accepted by default aren't set
1424
+ unless the flags passed to the function indicate that they should be
1425
+ rejected or warned about when encountering them. These are explicitly
1426
+ noted in the list below along with the controlling flags.
1423
1427
1424
1428
The bits returned in C<errors> and their meanings are:
1425
1429
@@ -1532,15 +1536,16 @@ be rejected or warned about.
1532
1536
If you don't care about the system's messages text nor warning categories, you
1533
1537
can customize error handling by calling one of the C<_error> functions, using
1534
1538
either of the flags C<UTF8_ALLOW_ANY> or C<UTF8_CHECK_ONLY> to suppress any
1535
- warnings, and then examine the C<*errors> return.
1539
+ warnings, and then examine the C<*errors> return. If you don't use those
1540
+ flags, warnings will be raised as usual.
1536
1541
1537
- But if you do care, use one of the functions with C<_msgs> in their names.
1538
- These allow you to completely customize error handling by suppressing any
1539
- warnings that would otherwise be raised; instead returning all needed
1542
+ But if you do care, instead use one of the functions with C<_msgs> in their
1543
+ names. These allow you to completely customize error handling by suppressing
1544
+ any warnings that would otherwise be raised; instead returning all relevant
1540
1545
information in a structure specified by an extra parameter, C<msgs>, a pointer
1541
1546
to a variable which has been declared to be an C<AV*>, and into which the
1542
- function creates a new AV to store information, described below, about all
1543
- the malformations that were encountered.
1547
+ function creates a new AV to store information, described below, about all the
1548
+ malformations that were encountered.
1544
1549
1545
1550
If the flag C<UTF8_CHECK_ONLY> is passed, this parameter is ignored.
1546
1551
Otherwise, when this parameter is set, the flags C<UTF8_DIE_IF_MALFORMED> and
@@ -1549,7 +1554,7 @@ C<UTF8_FORCE_WARN_IF_MALFORMED> are ignored.
1549
1554
What is considered a malformation is affected by C<flags>, the same as
1550
1555
described in C<L</utf8_to_uv_flags>>. No array element is generated for
1551
1556
malformations that are "allowed" by the input flags, in contrast to the
1552
- C<_error> functions .
1557
+ bitmap returned in a non-NULL C<*errors> .
1553
1558
1554
1559
Each element of the C<msgs> AV array is an anonymous hash with the following
1555
1560
three key-value pairs:
@@ -1558,12 +1563,18 @@ three key-value pairs:
1558
1563
1559
1564
=item C<text>
1560
1565
1561
- A C<SVpv> containing the text of any warning message that would have ordinarily
1562
- been generated. The function suppresses raising this warning itself.
1566
+ A C<SVpv> containing the text of the message about the problematic input.
1567
+ This text is identical to any warning that otherwise would have been raised if
1568
+ the appropriate warning categories were enabled.
1563
1569
1564
1570
=item C<warn_categories>
1565
1571
1566
- The warning category (or categories) for the message, packed into a C<SVuv>.
1572
+ This is 0 if the C<flags> parameter to the function would ordinarily not have
1573
+ caused the message to be output as a warning; otherwise it is the warning
1574
+ category (or categories) that would have been used to generate a warning for
1575
+ C<text>, packed into a C<SVuv>. For example, if C<flags> contains
1576
+ C<UTF8_DISALLOW_SURROGATE>, but not C<UTF8_WARN_SURROGATE>, this would be 0 if
1577
+ the input was a surrogate.
1567
1578
1568
1579
=item C<flag>
1569
1580
0 commit comments