@@ -1003,6 +1003,7 @@ S_unexpected_non_continuation_text(pTHX_ const U8 * const s,
1003
1003
=for apidoc_item extended_utf8_to_uv
1004
1004
=for apidoc_item strict_utf8_to_uv
1005
1005
=for apidoc_item c9strict_utf8_to_uv
1006
+ =for apidoc_item utf8_to_uv_or_die
1006
1007
=for apidoc_item utf8_to_uvchr_buf
1007
1008
=for apidoc_item utf8_to_uvchr
1008
1009
@@ -1099,6 +1100,11 @@ sequence. You can use that function or C<L</utf8_to_uv_flags>> to exert more
1099
1100
control over the input that is considered acceptable, and the warnings that are
1100
1101
raised.
1101
1102
1103
+ C<utf8_to_uv_or_die> has a simpler interface, for use when any errors are
1104
+ fatal. It returns the code point instead of using an output parameter, and
1105
+ throws an exception with any errors found where the other functions here would
1106
+ have returned false.
1107
+
1102
1108
Often, C<s> is an arbitrarily long string containing the UTF-8 representations
1103
1109
of many code points in a row, and these functions are called in the course of
1104
1110
parsing C<s> to find all those code points.
@@ -1107,8 +1113,8 @@ If your code doesn't know how to deal with illegal input, as would be typical
1107
1113
of a low level routine, the loop could look like:
1108
1114
1109
1115
while (s < e) {
1110
- UV cp;
1111
1116
Size_t advance;
1117
+ UV cp;
1112
1118
(void) utf8_to_uv(s, e, &cp, &advance);
1113
1119
<handle 'cp'>
1114
1120
s += advance;
@@ -1118,11 +1124,24 @@ A REPLACEMENT CHARACTER will be inserted everywhere that malformed input
1118
1124
occurs. Obviously, we aren't expecting such outcomes, but your code will be
1119
1125
protected from attacks and many harmful effects that could otherwise occur.
1120
1126
1127
+ If the situation is such that it would be a bug for the input to be invalid, a
1128
+ somewhat simpler loop suffices:
1129
+
1130
+ while (s < e) {
1131
+ Size_t advance;
1132
+ UV cp = utf8_to_uv_or_die(s, e, &advance);
1133
+ <handle 'cp'>
1134
+ s += advance;
1135
+ }
1136
+
1137
+ This will throw an exception on invalid input, so your code doesn't have to
1138
+ concern itself with that possibility.
1139
+
1121
1140
If you do have a plan for handling malformed input, you could instead write:
1122
1141
1123
1142
while (s < e) {
1124
- UV cp;
1125
1143
Size_t advance;
1144
+ UV cp;
1126
1145
1127
1146
if (UNLIKELY(! utf8_to_uv(s, e, &cp, &advance)) {
1128
1147
<bail out or convert to handleable>
@@ -1142,9 +1161,10 @@ attacks against such code; and it is extra work always, as the functions have
1142
1161
already done the equivalent work and return the correct value in C<advance>,
1143
1162
regardless of whether the input is well-formed or not.
1144
1163
1145
- You must always pass a non-NULL pointer into which to store the (first) code
1146
- point C<s> represents. If you don't care about this value, you should be using
1147
- one of the C<L</isUTF8_CHAR>> functions instead.
1164
+ Except with C<utf8_to_uv_or_die>, you must always pass a non-NULL pointer into
1165
+ which to store the (first) code point C<s> represents. If you don't care about
1166
+ this value, you should be using one of the C<L</isUTF8_CHAR>> functions
1167
+ instead.
1148
1168
1149
1169
=item C<utf8_to_uvchr> forms
1150
1170
@@ -1274,8 +1294,8 @@ This flag is ignored if C<UTF8_CHECK_ONLY> is also set.
1274
1294
=item C<UTF8_WARN_SURROGATE>
1275
1295
1276
1296
These reject and/or warn about UTF-8 sequences that represent surrogate
1277
- characters. The warning categories C<utf8> and C<super > control if warnings
1278
- are actually raised.
1297
+ characters. The warning categories C<utf8> and C<non_unicode > control if
1298
+ warnings are actually raised.
1279
1299
1280
1300
=item C<UTF8_DISALLOW_NONCHAR>
1281
1301
@@ -1290,7 +1310,7 @@ are actually raised.
1290
1310
=item C<UTF8_WARN_SUPER>
1291
1311
1292
1312
These reject and/or warn about UTF-8 sequences that represent code points
1293
- above 0x10FFFF. The warning categories C<utf8> and C<super > control if
1313
+ above 0x10FFFF. The warning categories C<utf8> and C<non_unicode > control if
1294
1314
warnings are actually raised.
1295
1315
1296
1316
=item C<UTF8_DISALLOW_ILLEGAL_INTERCHANGE>
@@ -1324,7 +1344,8 @@ These reject and/or warn on encountering sequences that require Perl's
1324
1344
extension to UTF-8 to represent them. These are all for code points above
1325
1345
0x10FFFF, so these sequences are a subset of the ones controlled by SUPER or
1326
1346
either of the illegal interchange sets of flags. The warning categories
1327
- C<utf8>, C<super>, and C<portable> control if warnings are actually raised.
1347
+ C<utf8>, C<non_unicode>, and C<portable> control if warnings are actually
1348
+ raised.
1328
1349
1329
1350
Perl predates Unicode, and earlier standards allowed for code points up through
1330
1351
0x7FFF_FFFF (2**31 - 1). Perl, of course, would like you to be able to
0 commit comments