Skip to content

Commit 304e974

Browse files
committed
[Clang] Correctly handle $, @, and ` when represented as UCN
This covers * P2558R2 (C++, wg21.link/P2558) * N2701 (C, https://www.open-std.org/jtc1/sc22/wg14/www/docs/n2701.htm) * N3124 (C, https://www.open-std.org/jtc1/sc22/wg14/www/docs/n3124.pdf) This patch * Disallow representing $ as a UCN in all language mode, which did not properly work (see GH62133), and which in made ill-formed in C++ and C by P2558 and N3124 respectively * Allow a UCN for any character in C2X, in string and character literals Fixes llvm#62133 Reviewed By: #clang-language-wg, tahonermann Differential Revision: https://reviews.llvm.org/D153621
1 parent 20ae2d2 commit 304e974

File tree

8 files changed

+105
-41
lines changed

8 files changed

+105
-41
lines changed

clang/docs/ReleaseNotes.rst

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -205,6 +205,9 @@ C2x Feature Support
205205
206206
bool b = nullptr; // Was incorrectly rejected by Clang, is now accepted.
207207
208+
- Implemented `WG14 N3124 <https://www.open-std.org/jtc1/sc22/wg14/www/docs/n3124.pdf>_`,
209+
which allows any universal character name to appear in character and string literals.
210+
208211

209212
Non-comprehensive list of changes in this release
210213
-------------------------------------------------
@@ -585,6 +588,9 @@ Bug Fixes in This Version
585588
- Correcly diagnose jumps into statement expressions.
586589
This ensures the behavior of Clang is consistent with GCC.
587590
(`#63682 <https://github.com/llvm/llvm-project/issues/63682>`_)
591+
(`#38717 <https://github.com/llvm/llvm-project/issues/38717>_`).
592+
- Fix an assertion when using ``\u0024`` (``$``) as an identifier, by disallowing
593+
that construct (`#62133 <https://github.com/llvm/llvm-project/issues/38717>_`).
588594

589595
Bug Fixes to Compiler Builtins
590596
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

clang/include/clang/Basic/DiagnosticLexKinds.td

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -197,6 +197,14 @@ def warn_cxx98_compat_literal_ucn_escape_basic_scs : Warning<
197197
def warn_cxx98_compat_literal_ucn_control_character : Warning<
198198
"universal character name referring to a control character "
199199
"is incompatible with C++98">, InGroup<CXX98Compat>, DefaultIgnore;
200+
def warn_c2x_compat_literal_ucn_escape_basic_scs : Warning<
201+
"specifying character '%0' with a universal character name is "
202+
"incompatible with C standards before C2x">,
203+
InGroup<CPre2xCompat>, DefaultIgnore;
204+
def warn_c2x_compat_literal_ucn_control_character : Warning<
205+
"universal character name referring to a control character "
206+
"is incompatible with C standards before C2x">,
207+
InGroup<CPre2xCompat>, DefaultIgnore;
200208
def warn_ucn_not_valid_in_c89 : Warning<
201209
"universal character names are only valid in C99 or C++; "
202210
"treating as '\\' followed by identifier">, InGroup<Unicode>;

clang/lib/Lex/Lexer.cpp

Lines changed: 8 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -3484,9 +3484,14 @@ uint32_t Lexer::tryReadUCN(const char *&StartPtr, const char *SlashLoc,
34843484
if (LangOpts.AsmPreprocessor)
34853485
return CodePoint;
34863486

3487-
// C99 6.4.3p2: A universal character name shall not specify a character whose
3488-
// short identifier is less than 00A0 other than 0024 ($), 0040 (@), or
3489-
// 0060 (`), nor one in the range D800 through DFFF inclusive.)
3487+
// C2x 6.4.3p2: A universal character name shall not designate a code point
3488+
// where the hexadecimal value is:
3489+
// - in the range D800 through DFFF inclusive; or
3490+
// - greater than 10FFFF.
3491+
// A universal-character-name outside the c-char-sequence of a character
3492+
// constant, or the s-char-sequence of a string-literal shall not designate
3493+
// a control character or a character in the basic character set.
3494+
34903495
// C++11 [lex.charset]p2: If the hexadecimal value for a
34913496
// universal-character-name corresponds to a surrogate code point (in the
34923497
// range 0xD800-0xDFFF, inclusive), the program is ill-formed. Additionally,
@@ -3496,9 +3501,6 @@ uint32_t Lexer::tryReadUCN(const char *&StartPtr, const char *SlashLoc,
34963501
// ranges 0x00-0x1F or 0x7F-0x9F, both inclusive) or to a character in the
34973502
// basic source character set, the program is ill-formed.
34983503
if (CodePoint < 0xA0) {
3499-
if (CodePoint == 0x24 || CodePoint == 0x40 || CodePoint == 0x60)
3500-
return CodePoint;
3501-
35023504
// We don't use isLexingRawMode() here because we need to warn about bad
35033505
// UCNs even when skipping preprocessing tokens in a #if block.
35043506
if (Result && PP) {

clang/lib/Lex/LiteralSupport.cpp

Lines changed: 14 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -640,22 +640,28 @@ static bool ProcessUCNEscape(const char *ThisTokBegin, const char *&ThisTokBuf,
640640
return false;
641641
}
642642

643-
// C++11 allows UCNs that refer to control characters and basic source
644-
// characters inside character and string literals
643+
// C2x and C++11 allow UCNs that refer to control characters
644+
// and basic source characters inside character and string literals
645645
if (UcnVal < 0xa0 &&
646-
(UcnVal != 0x24 && UcnVal != 0x40 && UcnVal != 0x60)) { // $, @, `
647-
bool IsError = (!Features.CPlusPlus11 || !in_char_string_literal);
646+
// $, @, ` are allowed in all language modes
647+
(UcnVal != 0x24 && UcnVal != 0x40 && UcnVal != 0x60)) {
648+
bool IsError =
649+
(!(Features.CPlusPlus11 || Features.C2x) || !in_char_string_literal);
648650
if (Diags) {
649651
char BasicSCSChar = UcnVal;
650652
if (UcnVal >= 0x20 && UcnVal < 0x7f)
651653
Diag(Diags, Features, Loc, ThisTokBegin, UcnBegin, ThisTokBuf,
652-
IsError ? diag::err_ucn_escape_basic_scs :
653-
diag::warn_cxx98_compat_literal_ucn_escape_basic_scs)
654+
IsError ? diag::err_ucn_escape_basic_scs
655+
: Features.CPlusPlus
656+
? diag::warn_cxx98_compat_literal_ucn_escape_basic_scs
657+
: diag::warn_c2x_compat_literal_ucn_escape_basic_scs)
654658
<< StringRef(&BasicSCSChar, 1);
655659
else
656660
Diag(Diags, Features, Loc, ThisTokBegin, UcnBegin, ThisTokBuf,
657-
IsError ? diag::err_ucn_control_character :
658-
diag::warn_cxx98_compat_literal_ucn_control_character);
661+
IsError ? diag::err_ucn_control_character
662+
: Features.CPlusPlus
663+
? diag::warn_cxx98_compat_literal_ucn_control_character
664+
: diag::warn_c2x_compat_literal_ucn_control_character);
659665
}
660666
if (IsError)
661667
return false;

clang/test/Lexer/char-literal.cpp

Lines changed: 27 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,9 @@
1+
// RUN: %clang_cc1 -triple x86_64-apple-darwin -std=c++03 -Wfour-char-constants -fsyntax-only -verify=cxx03,expected %s
12
// RUN: %clang_cc1 -triple x86_64-apple-darwin -std=c++11 -Wfour-char-constants -fsyntax-only -verify=cxx,expected %s
23
// RUN: %clang_cc1 -triple x86_64-apple-darwin -std=c++17 -Wfour-char-constants -fsyntax-only -verify=cxx,expected %s
34
// RUN: %clang_cc1 -triple x86_64-apple-darwin -std=c++20 -Wfour-char-constants -fsyntax-only -verify=cxx,expected %s
4-
// RUN: %clang_cc1 -triple x86_64-apple-darwin -std=c11 -x c -Wfour-char-constants -fsyntax-only -verify=c,expected %s
5-
// RUN: %clang_cc1 -triple x86_64-apple-darwin -std=c2x -x c -Wfour-char-constants -fsyntax-only -verify=c,expected %s
5+
// RUN: %clang_cc1 -triple x86_64-apple-darwin -std=c11 -x c -Wfour-char-constants -fsyntax-only -verify=c11,expected %s
6+
// RUN: %clang_cc1 -triple x86_64-apple-darwin -std=c2x -x c -Wfour-char-constants -fsyntax-only -verify=c2x,expected %s
67

78
#ifndef __cplusplus
89
typedef __WCHAR_TYPE__ wchar_t;
@@ -17,6 +18,7 @@ int c = 'APPS'; // expected-warning {{multi-character character constant}}
1718
char d = ''; // expected-error {{character too large for enclosing character literal type}}
1819
char e = '\u2318'; // expected-error {{character too large for enclosing character literal type}}
1920

21+
#if !defined(__cplusplus) || __cplusplus > 201100L
2022
#ifdef __cplusplus
2123
auto f = '\xE2\x8C\x98'; // expected-warning {{multi-character character constant}}
2224
#endif
@@ -44,18 +46,19 @@ char16_t q[2] = u"\U00010000";
4446

4547
// UTF-8 character literal code point ranges.
4648
#if __cplusplus >= 201703L || __STDC_VERSION__ >= 201710L
47-
_Static_assert(u8'\U00000000' == 0x00, ""); // c-error {{universal character name refers to a control character}}
48-
_Static_assert(u8'\U0000007F' == 0x7F, ""); // c-error {{universal character name refers to a control character}}
49-
_Static_assert(u8'\U00000080', ""); // c-error {{universal character name refers to a control character}}
49+
_Static_assert(u8'\U00000000' == 0x00, ""); // c11-error {{universal character name refers to a control character}}
50+
_Static_assert(u8'\U0000007F' == 0x7F, ""); // c11-error {{universal character name refers to a control character}}
51+
_Static_assert(u8'\U00000080', ""); // c11-error {{universal character name refers to a control character}}
5052
// cxx-error@-1 {{character too large for enclosing character literal type}}
53+
// c2x-error@-2 {{character too large for enclosing character literal type}}
5154
_Static_assert((unsigned char)u8'\xFF' == (unsigned char)0xFF, "");
5255
#endif
5356

5457
// UTF-8 string literal code point ranges.
55-
_Static_assert(u8"\U00000000"[0] == 0x00, ""); // c-error {{universal character name refers to a control character}}
56-
_Static_assert(u8"\U0000007F"[0] == 0x7F, ""); // c-error {{universal character name refers to a control character}}
57-
_Static_assert((unsigned char)u8"\U00000080"[0] == (unsigned char)0xC2, ""); // c-error {{universal character name refers to a control character}}
58-
_Static_assert((unsigned char)u8"\U00000080"[1] == (unsigned char)0x80, ""); // c-error {{universal character name refers to a control character}}
58+
_Static_assert(u8"\U00000000"[0] == 0x00, ""); // c11-error {{universal character name refers to a control character}}
59+
_Static_assert(u8"\U0000007F"[0] == 0x7F, ""); // c11-error {{universal character name refers to a control character}}
60+
_Static_assert((unsigned char)u8"\U00000080"[0] == (unsigned char)0xC2, ""); // c11-error {{universal character name refers to a control character}}
61+
_Static_assert((unsigned char)u8"\U00000080"[1] == (unsigned char)0x80, ""); // c11-error {{universal character name refers to a control character}}
5962
_Static_assert((unsigned char)u8"\U000007FF"[0] == (unsigned char)0xDF, "");
6063
_Static_assert((unsigned char)u8"\U000007FF"[1] == (unsigned char)0xBF, "");
6164
_Static_assert((unsigned char)u8"\U00000800"[0] == (unsigned char)0xE0, "");
@@ -84,14 +87,14 @@ _Static_assert(u8"\U00110000"[0], ""); // expected-error {{invalid universal cha
8487
#endif
8588

8689
// UTF-16 character literal code point ranges.
87-
_Static_assert(u'\U00000000' == 0x0000, ""); // c-error {{universal character name refers to a control character}}
90+
_Static_assert(u'\U00000000' == 0x0000, ""); // c11-error {{universal character name refers to a control character}}
8891
_Static_assert(u'\U0000D800', ""); // expected-error {{invalid universal character}}
8992
_Static_assert(u'\U0000DFFF', ""); // expected-error {{invalid universal character}}
9093
_Static_assert(u'\U0000FFFF' == 0xFFFF, "");
9194
_Static_assert(u'\U00010000', ""); // expected-error {{character too large for enclosing character literal type}}
9295

9396
// UTF-16 string literal code point ranges.
94-
_Static_assert(u"\U00000000"[0] == 0x0000, ""); // c-error {{universal character name refers to a control character}}
97+
_Static_assert(u"\U00000000"[0] == 0x0000, ""); // c11-error {{universal character name refers to a control character}}
9598
_Static_assert(u"\U0000D800"[0], ""); // expected-error {{invalid universal character}}
9699
_Static_assert(u"\U0000DFFF"[0], ""); // expected-error {{invalid universal character}}
97100
_Static_assert(u"\U0000FFFF"[0] == 0xFFFF, "");
@@ -109,13 +112,24 @@ _Static_assert(u"\U00110000"[0], ""); // expected-error {{invalid universal char
109112
#endif
110113

111114
// UTF-32 character literal code point ranges.
112-
_Static_assert(U'\U00000000' == 0x00000000, ""); // c-error {{universal character name refers to a control character}}
115+
_Static_assert(U'\U00000000' == 0x00000000, ""); // c11-error {{universal character name refers to a control character}}
113116
_Static_assert(U'\U0010FFFF' == 0x0010FFFF, "");
114117
_Static_assert(U'\U00110000', ""); // expected-error {{invalid universal character}}
115118

116119
// UTF-32 string literal code point ranges.
117-
_Static_assert(U"\U00000000"[0] == 0x00000000, ""); // c-error {{universal character name refers to a control character}}
120+
_Static_assert(U"\U00000000"[0] == 0x00000000, ""); // c11-error {{universal character name refers to a control character}}
118121
_Static_assert(U"\U0000D800"[0], ""); // expected-error {{invalid universal character}}
119122
_Static_assert(U"\U0000DFFF"[0], ""); // expected-error {{invalid universal character}}
120123
_Static_assert(U"\U0010FFFF"[0] == 0x0010FFFF, "");
121124
_Static_assert(U"\U00110000"[0], ""); // expected-error {{invalid universal character}}
125+
126+
#endif // !defined(__cplusplus) || __cplusplus > 201100L
127+
128+
_Static_assert('\u0024' == '$', "");
129+
_Static_assert('\u0040' == '@', "");
130+
_Static_assert('\u0060' == '`', "");
131+
132+
_Static_assert('\u0061' == 'a', ""); // c11-error {{character 'a' cannot be specified by a universal character name}} \
133+
// cxx03-error {{character 'a' cannot be specified by a universal character name}}
134+
_Static_assert('\u0000' == '\0', ""); // c11-error {{universal character name refers to a control character}} \
135+
// cxx03-error {{universal character name refers to a control character}}

clang/test/Lexer/utf8-char-literal.cpp

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,7 @@ char f = u8'ab'; // expected-error {{Unicode character literals may not contain
1919
#elif __STDC_VERSION__ >= 202000L
2020
char a = u8'ñ'; // expected-error {{character too large for enclosing character literal type}}
2121
char b = u8'\x80'; // ok
22-
char c = u8'\u0080'; // expected-error {{universal character name refers to a control character}}
22+
char c = u8'\u0000'; // ok
2323
char d = u8'\u1234'; // expected-error {{character too large for enclosing character literal type}}
2424
char e = u8''; // expected-error {{character too large for enclosing character literal type}}
2525
char f = u8'ab'; // expected-error {{Unicode character literals may not contain multiple characters}}

clang/test/Preprocessor/ucn-allowed-chars.c

Lines changed: 36 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,5 @@
11
// RUN: %clang_cc1 %s -fsyntax-only -std=c99 -verify
2+
// RUN: %clang_cc1 %s -fsyntax-only -std=c2x -Wc99-compat -verify
23
// RUN: %clang_cc1 %s -fsyntax-only -std=c11 -Wc99-compat -verify
34
// RUN: %clang_cc1 %s -fsyntax-only -x c++ -std=c++03 -Wc++11-compat -verify
45
// RUN: %clang_cc1 %s -fsyntax-only -x c++ -std=c++11 -Wc++98-compat -verify
@@ -13,7 +14,6 @@ extern char a\uFFFF; // none
1314

1415

1516

16-
1717
// Identifier initial characters
1818
extern char \u0E50; // C++03, C11, C++11
1919
extern char \u0300; // disallowed in C99/C++03
@@ -38,8 +38,8 @@ extern char \u0D61; // C99, C11, C++03, C++11
3838

3939

4040
#if __cplusplus
41-
// expected-error@9 {{character <U+0384> not allowed in an identifier}}
42-
// expected-error@11 {{character <U+FFFF> not allowed in an identifier}}
41+
// expected-error@10 {{character <U+0384> not allowed in an identifier}}
42+
// expected-error@12 {{character <U+FFFF> not allowed in an identifier}}
4343
// expected-error@18 {{expected unqualified-id}}
4444
# if __cplusplus >= 201103L
4545
// C++11
@@ -53,23 +53,49 @@ extern char \u0D61; // C99, C11, C++03, C++11
5353

5454
# endif
5555
#else
56-
# if __STDC_VERSION__ >= 201112L
56+
# if __STDC_VERSION__ >= 201800L
57+
// C2X
58+
// expected-warning@8 {{using this character in an identifier is incompatible with C99}}
59+
// expected-error@10 {{character <U+0384> not allowed in an identifier}}
60+
// expected-error@12 {{character <U+FFFF> not allowed in an identifier}}
61+
// expected-error@18 {{expected identifier}}
62+
// expected-error@19 {{expected identifier}}
63+
// expected-error@33 {{invalid universal character}}
64+
# elif __STDC_VERSION__ >= 201112L
5765
// C11
58-
// expected-warning@7 {{using this character in an identifier is incompatible with C99}}
59-
// expected-warning@9 {{using this character in an identifier is incompatible with C99}}
60-
// expected-error@11 {{character <U+FFFF> not allowed in an identifier}}
66+
// expected-warning@8 {{using this character in an identifier is incompatible with C99}}
67+
// expected-warning@10 {{using this character in an identifier is incompatible with C99}}
68+
// expected-error@12 {{character <U+FFFF> not allowed in an identifier}}
6169
// expected-warning@18 {{starting an identifier with this character is incompatible with C99}}
6270
// expected-error@19 {{expected identifier}}
6371
// expected-error@33 {{invalid universal character}}
6472

6573
# else
6674
// C99
67-
// expected-error@7 {{not allowed in an identifier}}
68-
// expected-error@9 {{not allowed in an identifier}}
69-
// expected-error@11 {{not allowed in an identifier}}
75+
// expected-error@8 {{not allowed in an identifier}}
76+
// expected-error@10 {{not allowed in an identifier}}
77+
// expected-error@12 {{not allowed in an identifier}}
7078
// expected-error@18 {{expected identifier}}
7179
// expected-error@19 {{expected identifier}}
7280
// expected-error@33 {{invalid universal character}}
7381

7482
# endif
7583
#endif
84+
85+
#define AAA\u0024 // expected-error {{character '$' cannot be specified by a universal character name}} \
86+
// expected-warning {{whitespace}}
87+
#define AAB\u0040 // expected-error {{character '@' cannot be specified by a universal character name}} \
88+
// expected-warning {{whitespace}}
89+
#define AAC\u0060 // expected-error {{character '`' cannot be specified by a universal character name}} \
90+
// expected-warning {{whitespace}}
91+
92+
#define ABA \u0024 // expected-error {{character '$' cannot be specified by a universal character name}}
93+
#define ABB \u0040 // expected-error {{character '@' cannot be specified by a universal character name}}
94+
#define ABC \u0060 // expected-error {{character '`' cannot be specified by a universal character name}}
95+
96+
int GH62133_a\u0024; // expected-error {{character '$' cannot be specified by a universal character name}} \
97+
// expected-error {{}}
98+
int GH62133_b\u0040; // expected-error {{character '@' cannot be specified by a universal character name}} \
99+
// expected-error {{}}
100+
int GH62133_c\u0060; // expected-error {{character '`' cannot be specified by a universal character name}} \
101+
// expected-error {{}}

clang/test/Preprocessor/ucn-pp-identifier.c

Lines changed: 5 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,5 @@
11
// RUN: %clang_cc1 %s -fsyntax-only -std=c99 -pedantic -verify=expected,ext -Wundef -DTRIGRAPHS=1
2+
// RUN: %clang_cc1 %s -fsyntax-only -std=c2x -pedantic -verify=expected,ext -Wundef -DTRIGRAPHS=1
23
// RUN: %clang_cc1 %s -fsyntax-only -x c++ -pedantic -verify=expected,ext -Wundef -fno-trigraphs
34
// RUN: %clang_cc1 %s -fsyntax-only -x c++ -std=c++23 -pedantic -ftrigraphs -DTRIGRAPHS=1 -verify=expected,cxx23 -Wundef -Wpre-c++23-compat
45
// RUN: %clang_cc1 %s -fsyntax-only -x c++ -pedantic -verify=expected,ext -Wundef -ftrigraphs -DTRIGRAPHS=1
@@ -40,7 +41,8 @@
4041
// ext-warning {{extension}} cxx23-warning {{before C++23}}
4142
#define \N{WASTEBASKET} // expected-error {{macro name must be an identifier}} \
4243
// ext-warning {{extension}} cxx23-warning {{before C++23}}
43-
#define a\u0024
44+
#define a\u0024a // expected-error {{character '$' cannot be specified by a universal character name}} \
45+
// expected-warning {{requires whitespace after the macro name}}
4446

4547
#if \u0110 // expected-warning {{is not defined, evaluates to 0}}
4648
#endif
@@ -112,7 +114,7 @@ C 1
112114
#define capital_u_\U00FC
113115
// expected-warning@-1 {{incomplete universal character name}} expected-note@-1 {{did you mean to use '\u'?}} expected-warning@-1 {{whitespace}}
114116
// CHECK: note: did you mean to use '\u'?
115-
// CHECK-NEXT: {{^ 112 | #define capital_u_\U00FC}}
117+
// CHECK-NEXT: {{^ .* | #define capital_u_\U00FC}}
116118
// CHECK-NEXT: {{^ | \^}}
117119
// CHECK-NEXT: {{^ | u}}
118120

@@ -155,5 +157,5 @@ int a\N{LATIN CAPITAL LETTER A WITH GRAVE??>; // expected-warning {{trigraph con
155157
int a\N{LATIN CAPITAL LETTER A WITH GRAVE??>;
156158
// expected-warning@-1 {{trigraph ignored}}\
157159
// expected-warning@-1 {{incomplete}}\
158-
// expected-error@-1 {{expected ';' after top level declarator}}
160+
// expected-error@-1 {{expected unqualified-id}}
159161
#endif

0 commit comments

Comments
 (0)