@@ -1045,13 +1045,15 @@ these. Private use characters and those code points yet to be assigned to a
1045
1045
particular character are never considered problematic. Additionally, most of
1046
1046
the functions accept non-Unicode code points, those starting at 0x110000.
1047
1047
1048
+ There are two sets of these functions:
1049
+
1048
1050
=over 4
1049
1051
1050
1052
=item C<utf8_to_uv> forms
1051
1053
1052
1054
Almost all code should use only C<utf8_to_uv>, C<extended_utf8_to_uv>,
1053
1055
C<strict_utf8_to_uv>, or C<c9strict_utf8_to_uv>. The other functions are
1054
- either the problematic old form, or are for highly specialized uses.
1056
+ either the problematic old form, or are for specialized uses.
1055
1057
1056
1058
These four functions each return C<true> if the sequence of bytes starting at
1057
1059
C<s> form a complete, legal UTF-8 (or UTF-EBCDIC) sequence for a code point.
@@ -1087,16 +1089,17 @@ instead uses the exchangeable definition given by Unicode's Corregendum #9,
1087
1089
which accepts non-character code points while still rejecting surrogates. It
1088
1090
does not raise a warning when rejecting.
1089
1091
1090
- =item * C<extended_utf8_to_uv >
1092
+ =item * C<utf8_to_uv >
1091
1093
1092
1094
accepts all syntactically valid UTF-8, as extended by Perl to allow 64-bit code
1093
1095
points to be encoded.
1094
1096
1095
- =back
1097
+ C<extended_utf8_to_uv> is merely a synonym for C<utf8_to_uv>. Use this form
1098
+ to draw attention to the fact that it accepts any code point. But since
1099
+ Perl programs traditionally do this by default, plain C<utf8_to_uv> is the form
1100
+ most often used.
1096
1101
1097
- C<utf8_to_uv> is merely a synonym for C<extended_utf8_to_uv>, whose name
1098
- explicitly indicates that it accepts Perl-extended UTF-8. Perl programs
1099
- traditionally handle this by default.
1102
+ =back
1100
1103
1101
1104
Whenever syntactically invalid input is rejected, an explanatory warning
1102
1105
message is raised, unless C<utf8> warnings (or the appropriate subcategory) are
@@ -1234,18 +1237,20 @@ unlikely to be needed except for specialized purposes.
1234
1237
C<utf8n_to_uvchr> is more like an extension of C<utf8_to_uvchr_buf>, but
1235
1238
with fewer quirks, and a different method of specifying the bytes in C<s> it is
1236
1239
allowed to examine. It has a C<curlen> parameter instead of an C<e> parameter,
1237
- so the furthest byte in C<s> it can look at is S<C<s + curlen>>. Its return
1238
- value is, like C<utf8_to_uvchr_buf>, ambiguous with respect to the NUL and
1239
- REPLACEMENT characters, but the value of C<*retlen> can be relied on (except
1240
- with the C<UTF8_CHECK_ONLY> flag described below) to know where the next
1241
- possible character along C<s> starts, removing that quirk. Hence, you always
1242
- should use C<*retlen> to determine where the next character in C<s> starts.
1240
+ so the furthest byte in C<s> it can look at is S<C<s + curlen - 1>>. Its
1241
+ return value is, like C<utf8_to_uvchr_buf>, ambiguous with respect to the NUL
1242
+ and REPLACEMENT characters, but the value of C<*retlen> can be relied on
1243
+ (except with the C<UTF8_CHECK_ONLY> flag described below) to know where the
1244
+ next possible character along C<s> starts, removing that quirk. Hence, you
1245
+ always should use C<*retlen> to determine where the next character in C<s>
1246
+ starts.
1243
1247
1244
1248
These functions have an additional parameter, C<flags>, besides the ones in
1245
1249
C<utf8_to_uv> and C<utf8_to_uvchr_buf>, which can be used to broaden or
1246
1250
restrict what is acceptable UTF-8. C<flags> has the same meaning and behavior
1247
1251
in both functions. When C<flags> is 0, these functions accept any
1248
- syntactically valid Perl-extended-UTF-8 sequence.
1252
+ syntactically valid Perl-extended-UTF-8 sequence that doesn't overflow the
1253
+ platform's word size.
1249
1254
1250
1255
There are flags that apply to accepting particular sequences, and flags that
1251
1256
apply to raising warnings about encountering sequences. Each type is
@@ -1254,15 +1259,14 @@ or both reject and warn. Rejecting means that the sequence gets translated
1254
1259
into the Unicode REPLACEMENT CHARACTER instead of what it was meant to
1255
1260
represent.
1256
1261
1257
- Even if a flag is passed that indicates warnings are desired; no warning will be
1258
- raised if C<'utf8'> warnings (or the appropriate subcategory) are disabled at
1259
- the point of the call.
1262
+ Unless otherwise stated below, warnings are subject to the C<utf8> warnings
1263
+ category being on.
1260
1264
1261
1265
=over 4
1262
1266
1263
1267
=item C<UTF8_CHECK_ONLY>
1264
1268
1265
- This also suppresses any warnings. And it changes what is stored into
1269
+ This suppresses any warnings. And it changes what is stored into
1266
1270
C<*retlen> with the C<uvchr> family of functions (for the worse). It is not
1267
1271
likely to be of use to you. You can use C<UTF8_ALLOW_ANY> (described below) to
1268
1272
also turn off warnings, and that flag doesn't adversely affect C<*retlen>.
@@ -1271,22 +1275,25 @@ also turn off warnings, and that flag doesn't adversely affect C<*retlen>.
1271
1275
1272
1276
=item C<UTF8_WARN_SURROGATE>
1273
1277
1274
- These disallow and/or warn about UTF-8 sequences that represent surrogate
1275
- characters.
1278
+ These reject and/or warn about UTF-8 sequences that represent surrogate
1279
+ characters. The warning categories C<utf8> and C<super> control if warnings
1280
+ are actually raised.
1276
1281
1277
1282
=item C<UTF8_DISALLOW_NONCHAR>
1278
1283
1279
1284
=item C<UTF8_WARN_NONCHAR>
1280
1285
1281
- These disallow and/or warn about UTF-8 sequences that represent non-character
1282
- code points.
1286
+ These reject and/or warn about UTF-8 sequences that represent non-character
1287
+ code points. The warning categories C<utf8> and C<nonchar> control if warnings
1288
+ are actually raised.
1283
1289
1284
1290
=item C<UTF8_DISALLOW_SUPER>
1285
1291
1286
1292
=item C<UTF8_WARN_SUPER>
1287
1293
1288
- These disallow and/or warn about UTF-8 sequences that represent code points
1289
- above 0x10FFFF.
1294
+ These reject and/or warn about UTF-8 sequences that represent code points
1295
+ above 0x10FFFF. The warning categories C<utf8> and C<super> control if
1296
+ warnings are actually raised.
1290
1297
1291
1298
=item C<UTF8_DISALLOW_ILLEGAL_INTERCHANGE>
1292
1299
@@ -1315,10 +1322,11 @@ L<perlunicode/Noncharacter code points>.
1315
1322
1316
1323
=item C<UTF8_WARN_PERL_EXTENDED>
1317
1324
1318
- These disallow and/or warn on encountering sequences that require Perl's
1325
+ These reject and/or warn on encountering sequences that require Perl's
1319
1326
extension to UTF-8 to represent them. These are all for code points above
1320
1327
0x10FFFF, so these sequences are a subset of the ones controlled by SUPER or
1321
- either of the illegal interchange sets of flags.
1328
+ either of the illegal interchange sets of flags. The warning categories
1329
+ C<utf8>, C<super>, and C<portable> control if warnings are actually raised.
1322
1330
1323
1331
Perl predates Unicode, and earlier standards allowed for code points up through
1324
1332
0x7FFF_FFFF (2**31 - 1). Perl, of course, would like you to be able to
@@ -1354,8 +1362,8 @@ regardless of any of the flags.
1354
1362
1355
1363
The only such flag that you would ever have any reason to use is
1356
1364
C<UTF8_ALLOW_ANY> which applies to any of the syntactic malformations and
1357
- overflow, except for empty input. The other flags are shown in the C<_GOT_>
1358
- bits list in C<L</utf8_to_uv_msgs>>.
1365
+ overflow, except for empty input. The other flags are analogous to ones in
1366
+ the C<_GOT_> bits list in C<L</utf8_to_uv_msgs>>.
1359
1367
1360
1368
=back
1361
1369
0 commit comments