Skip to content

Commit 6e88530

Browse files
committed
perlapi: Fixups of pod for utf8_to_uv family
1 parent df4d5e8 commit 6e88530

File tree

1 file changed

+35
-27
lines changed

1 file changed

+35
-27
lines changed

utf8.c

Lines changed: 35 additions & 27 deletions
Original file line numberDiff line numberDiff line change
@@ -1045,13 +1045,15 @@ these. Private use characters and those code points yet to be assigned to a
10451045
particular character are never considered problematic. Additionally, most of
10461046
the functions accept non-Unicode code points, those starting at 0x110000.
10471047
1048+
There are two sets of these functions:
1049+
10481050
=over 4
10491051
10501052
=item C<utf8_to_uv> forms
10511053
10521054
Almost all code should use only C<utf8_to_uv>, C<extended_utf8_to_uv>,
10531055
C<strict_utf8_to_uv>, or C<c9strict_utf8_to_uv>. The other functions are
1054-
either the problematic old form, or are for highly specialized uses.
1056+
either the problematic old form, or are for specialized uses.
10551057
10561058
These four functions each return C<true> if the sequence of bytes starting at
10571059
C<s> form a complete, legal UTF-8 (or UTF-EBCDIC) sequence for a code point.
@@ -1087,16 +1089,17 @@ instead uses the exchangeable definition given by Unicode's Corregendum #9,
10871089
which accepts non-character code points while still rejecting surrogates. It
10881090
does not raise a warning when rejecting.
10891091
1090-
=item * C<extended_utf8_to_uv>
1092+
=item * C<utf8_to_uv>
10911093
10921094
accepts all syntactically valid UTF-8, as extended by Perl to allow 64-bit code
10931095
points to be encoded.
10941096
1095-
=back
1097+
C<extended_utf8_to_uv> is merely a synonym for C<utf8_to_uv>. Use this form
1098+
to draw attention to the fact that it accepts any code point. But since
1099+
Perl programs traditionally do this by default, plain C<utf8_to_uv> is the form
1100+
most often used.
10961101
1097-
C<utf8_to_uv> is merely a synonym for C<extended_utf8_to_uv>, whose name
1098-
explicitly indicates that it accepts Perl-extended UTF-8. Perl programs
1099-
traditionally handle this by default.
1102+
=back
11001103
11011104
Whenever syntactically invalid input is rejected, an explanatory warning
11021105
message is raised, unless C<utf8> warnings (or the appropriate subcategory) are
@@ -1234,18 +1237,20 @@ unlikely to be needed except for specialized purposes.
12341237
C<utf8n_to_uvchr> is more like an extension of C<utf8_to_uvchr_buf>, but
12351238
with fewer quirks, and a different method of specifying the bytes in C<s> it is
12361239
allowed to examine. It has a C<curlen> parameter instead of an C<e> parameter,
1237-
so the furthest byte in C<s> it can look at is S<C<s + curlen>>. Its return
1238-
value is, like C<utf8_to_uvchr_buf>, ambiguous with respect to the NUL and
1239-
REPLACEMENT characters, but the value of C<*retlen> can be relied on (except
1240-
with the C<UTF8_CHECK_ONLY> flag described below) to know where the next
1241-
possible character along C<s> starts, removing that quirk. Hence, you always
1242-
should use C<*retlen> to determine where the next character in C<s> starts.
1240+
so the furthest byte in C<s> it can look at is S<C<s + curlen - 1>>. Its
1241+
return value is, like C<utf8_to_uvchr_buf>, ambiguous with respect to the NUL
1242+
and REPLACEMENT characters, but the value of C<*retlen> can be relied on
1243+
(except with the C<UTF8_CHECK_ONLY> flag described below) to know where the
1244+
next possible character along C<s> starts, removing that quirk. Hence, you
1245+
always should use C<*retlen> to determine where the next character in C<s>
1246+
starts.
12431247
12441248
These functions have an additional parameter, C<flags>, besides the ones in
12451249
C<utf8_to_uv> and C<utf8_to_uvchr_buf>, which can be used to broaden or
12461250
restrict what is acceptable UTF-8. C<flags> has the same meaning and behavior
12471251
in both functions. When C<flags> is 0, these functions accept any
1248-
syntactically valid Perl-extended-UTF-8 sequence.
1252+
syntactically valid Perl-extended-UTF-8 sequence that doesn't overflow the
1253+
platform's word size.
12491254
12501255
There are flags that apply to accepting particular sequences, and flags that
12511256
apply to raising warnings about encountering sequences. Each type is
@@ -1254,15 +1259,14 @@ or both reject and warn. Rejecting means that the sequence gets translated
12541259
into the Unicode REPLACEMENT CHARACTER instead of what it was meant to
12551260
represent.
12561261
1257-
Even if a flag is passed that indicates warnings are desired; no warning will be
1258-
raised if C<'utf8'> warnings (or the appropriate subcategory) are disabled at
1259-
the point of the call.
1262+
Unless otherwise stated below, warnings are subject to the C<utf8> warnings
1263+
category being on.
12601264
12611265
=over 4
12621266
12631267
=item C<UTF8_CHECK_ONLY>
12641268
1265-
This also suppresses any warnings. And it changes what is stored into
1269+
This suppresses any warnings. And it changes what is stored into
12661270
C<*retlen> with the C<uvchr> family of functions (for the worse). It is not
12671271
likely to be of use to you. You can use C<UTF8_ALLOW_ANY> (described below) to
12681272
also turn off warnings, and that flag doesn't adversely affect C<*retlen>.
@@ -1271,22 +1275,25 @@ also turn off warnings, and that flag doesn't adversely affect C<*retlen>.
12711275
12721276
=item C<UTF8_WARN_SURROGATE>
12731277
1274-
These disallow and/or warn about UTF-8 sequences that represent surrogate
1275-
characters.
1278+
These reject and/or warn about UTF-8 sequences that represent surrogate
1279+
characters. The warning categories C<utf8> and C<super> control if warnings
1280+
are actually raised.
12761281
12771282
=item C<UTF8_DISALLOW_NONCHAR>
12781283
12791284
=item C<UTF8_WARN_NONCHAR>
12801285
1281-
These disallow and/or warn about UTF-8 sequences that represent non-character
1282-
code points.
1286+
These reject and/or warn about UTF-8 sequences that represent non-character
1287+
code points. The warning categories C<utf8> and C<nonchar> control if warnings
1288+
are actually raised.
12831289
12841290
=item C<UTF8_DISALLOW_SUPER>
12851291
12861292
=item C<UTF8_WARN_SUPER>
12871293
1288-
These disallow and/or warn about UTF-8 sequences that represent code points
1289-
above 0x10FFFF.
1294+
These reject and/or warn about UTF-8 sequences that represent code points
1295+
above 0x10FFFF. The warning categories C<utf8> and C<super> control if
1296+
warnings are actually raised.
12901297
12911298
=item C<UTF8_DISALLOW_ILLEGAL_INTERCHANGE>
12921299
@@ -1315,10 +1322,11 @@ L<perlunicode/Noncharacter code points>.
13151322
13161323
=item C<UTF8_WARN_PERL_EXTENDED>
13171324
1318-
These disallow and/or warn on encountering sequences that require Perl's
1325+
These reject and/or warn on encountering sequences that require Perl's
13191326
extension to UTF-8 to represent them. These are all for code points above
13201327
0x10FFFF, so these sequences are a subset of the ones controlled by SUPER or
1321-
either of the illegal interchange sets of flags.
1328+
either of the illegal interchange sets of flags. The warning categories
1329+
C<utf8>, C<super>, and C<portable> control if warnings are actually raised.
13221330
13231331
Perl predates Unicode, and earlier standards allowed for code points up through
13241332
0x7FFF_FFFF (2**31 - 1). Perl, of course, would like you to be able to
@@ -1354,8 +1362,8 @@ regardless of any of the flags.
13541362
13551363
The only such flag that you would ever have any reason to use is
13561364
C<UTF8_ALLOW_ANY> which applies to any of the syntactic malformations and
1357-
overflow, except for empty input. The other flags are shown in the C<_GOT_>
1358-
bits list in C<L</utf8_to_uv_msgs>>.
1365+
overflow, except for empty input. The other flags are analogous to ones in
1366+
the C<_GOT_> bits list in C<L</utf8_to_uv_msgs>>.
13591367
13601368
=back
13611369

0 commit comments

Comments
 (0)