13
13
In <sref ref =" [re.grammar]" /> paragraph 2:
14
14
</p >
15
15
<blockquote ><p >
16
- <tt >basic_regex</tt > member functions shall not call any locale dependent C or C++ API, including the formatted
16
+ <tt >basic_regex</tt > member functions shall not call any locale dependent C or C++ API, including the formatted
17
17
string input functions. Instead they shall call the appropriate traits member function to achieve the required effect.
18
18
</p ></blockquote >
19
19
<p >
20
- Yet, the required interface for a regular expression traits class (<sref ref =" [re.req]" />) does not appear to have
21
- any reliable method for determining whether a character as encoded for the locale associated with the traits
22
- instance is the same as a character represented by a <em >UnicodeEscapeSequence</em >, e.g., assuming a sane
20
+ Yet, the required interface for a regular expression traits class (<sref ref =" [re.req]" />) does not appear to have
21
+ any reliable method for determining whether a character as encoded for the locale associated with the traits
22
+ instance is the same as a character represented by a <em >UnicodeEscapeSequence</em >, e.g., assuming a sane
23
23
<tt >ru_RU.koi8r</tt > locale:
24
24
</p >
25
25
<blockquote ><pre >
@@ -30,7 +30,7 @@ instance is the same as a character represented by a <em>UnicodeEscapeSequence</
30
30
const char data[] = "\xB3";
31
31
const char matchCyrillicCaptialLetterYo[] = R"(\u0401)";
32
32
33
- int main(void)
33
+ int main(void)
34
34
{
35
35
try {
36
36
std::regex myRegex;
@@ -57,6 +57,24 @@ The implementation I tried prints:
57
57
<p >
58
58
Which means that the character class matching worked, but not the matching to the <em >UnicodeEscapeSequence</em >.
59
59
</p >
60
+
61
+ <note >2024-10-03; Jonathan comments</note >
62
+ <p >
63
+ <code >std::basic_regex< charT> </code > only properly supports
64
+ matching single code units that fit in `charT`.
65
+ There's nothing in the spec that supports matching code points that
66
+ require multiple code units, let alone checking whether a character
67
+ in an arbitrary encoding corresponds to any given Unicode code point.
68
+ <sref ref =" [re.grammar]" /> paragraph 12 appears to be an attempt to
69
+ allow implementations to fail to match here, but is insufficient.
70
+ When <code >is_unsigned_v< char> </code > is true, the CV of the
71
+ <i >UnicodeEscapeSequence</i > `"\u0080"` is not greater than `CHAR_MAX`,
72
+ but that doesn't help because U+0080 is encoded as two bytes in UTF-8.
73
+ Being able to represent `0x80` as `char` does not mean the CV can be
74
+ matched as a single `char`.
75
+ The API is unsuitable for Unicode-aware strings.
76
+ </p >
77
+
60
78
</discussion >
61
79
62
80
<resolution >
0 commit comments