Skip to content

Commit c9b0d4c

Browse files
committed
Add comment to 2546
1 parent cd90a92 commit c9b0d4c

File tree

1 file changed

+22
-5
lines changed

1 file changed

+22
-5
lines changed

xml/issue2546.xml

Lines changed: 22 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -13,13 +13,13 @@
1313
In <sref ref="[re.grammar]"/> paragraph 2:
1414
</p>
1515
<blockquote><p>
16-
<tt>basic_regex</tt> member functions shall not call any locale dependent C or C++ API, including the formatted
16+
<tt>basic_regex</tt> member functions shall not call any locale dependent C or C++ API, including the formatted
1717
string input functions. Instead they shall call the appropriate traits member function to achieve the required effect.
1818
</p></blockquote>
1919
<p>
20-
Yet, the required interface for a regular expression traits class (<sref ref="[re.req]"/>) does not appear to have
21-
any reliable method for determining whether a character as encoded for the locale associated with the traits
22-
instance is the same as a character represented by a <em>UnicodeEscapeSequence</em>, e.g., assuming a sane
20+
Yet, the required interface for a regular expression traits class (<sref ref="[re.req]"/>) does not appear to have
21+
any reliable method for determining whether a character as encoded for the locale associated with the traits
22+
instance is the same as a character represented by a <em>UnicodeEscapeSequence</em>, e.g., assuming a sane
2323
<tt>ru_RU.koi8r</tt> locale:
2424
</p>
2525
<blockquote><pre>
@@ -30,7 +30,7 @@ instance is the same as a character represented by a <em>UnicodeEscapeSequence</
3030
const char data[] = "\xB3";
3131
const char matchCyrillicCaptialLetterYo[] = R"(\u0401)";
3232

33-
int main(void)
33+
int main(void)
3434
{
3535
try {
3636
std::regex myRegex;
@@ -57,6 +57,23 @@ The implementation I tried prints:
5757
<p>
5858
Which means that the character class matching worked, but not the matching to the <em>UnicodeEscapeSequence</em>.
5959
</p>
60+
61+
<note>2024-10-03; Jonathan comments</note>
62+
<p>
63+
<code>std::basic_regex&lt;charT&gt;</code> only properly supports
64+
matching single code units that fit in `charT`.
65+
There's nothing in the spec that supports matching code points that
66+
require multiple code units.
67+
<sref ref="[re.grammar]"/> paragraph 12 appears to be an attempt to
68+
allow implementations to fail to match here, but is insufficient.
69+
When <code>is_unsigned_v&lt;char&gt;</code> is true, the CV of the
70+
<i>UnicodeEscapeSequence</i> `"\u0080"` is not greater than `CHAR_MAX`,
71+
but that doesn't help because U+0080 is encoded as two bytes in UTF-8.
72+
Being able to represent `0x80` as `char` does not mean the CV can be
73+
matched as a single `char`.
74+
The API is just not suitable for Unicode-aware strings.
75+
</p>
76+
6077
</discussion>
6178

6279
<resolution>

0 commit comments

Comments
 (0)