Skip to content

Commit 32c8d67

Browse files
committed
Update RFC.
1 parent 3892774 commit 32c8d67

File tree

1 file changed

+36
-20
lines changed

1 file changed

+36
-20
lines changed

text/3349-mixed-utf8-literals.md

Lines changed: 36 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -6,9 +6,9 @@
66
# Summary
77
[summary]: #summary
88

9-
Allow the exact same characters and escape codes in `"…"` and `b"…"` literals.
9+
Relax the restrictions on which escape codes are allowed in string, char, and byte literals.
1010

11-
That is:
11+
Most importantly, this means we accept the exact same characters and escape codes in `"…"` and `b"…"` literals. That is:
1212

1313
- Allow unicode characters, including `\u{…}` escape codes, in byte string literals. E.g. `b"hello\xff我叫\u{1F980}"`
1414
- Also allow non-ASCII `\x…` escape codes in regular string literals, as long as they are valid UTF-8. E.g. `"\xf0\x9f\xa6\x80"`
@@ -46,17 +46,46 @@ for different literal types. We'd only require regular string literals to be val
4646
# Guide-level explanation
4747
[guide-level-explanation]: #guide-level-explanation
4848

49-
Regular string literals (`""`) must be valid UTF-8.
49+
Regular string literals (`""` and `r""`) must be valid UTF-8.
5050
For example, valid strings are `"abc"`, `"🦀"`, `"\u{1F980}"` and `"\xf0\x9f\xa6\x80"`.
51-
`"\x80"` is not valid, however, as that is not valid UTF-8.
51+
`"\xff"` is not valid, however, as that is not valid UTF-8.
5252

53-
Byte string literals (`b""`) may include non-ascii characters and unicode escape codes (`\u{…}`), which will be encoded as UTF-8.
53+
Byte string literals (`b""` and `br""`) may include non-ascii characters and unicode escape codes (`\u{…}`), which will be encoded as UTF-8.
54+
55+
The `char` type does not store UTF-8, so while `'\u{1F980}'` is valid, trying to encode it in UTF-8 as in `'\xf0\x9f\xa6\x80'` is not accepted.
56+
In a char literal (`''`), `\x` may only be used for values 0 through 0x7F.
57+
58+
Similarly, in a byte literal (`b''`), `\u` may only be used for values 0 through 0x7F, since those are the only code points that are unambiguously represented as a single byte.
5459

5560
# Reference-level explanation
5661
[reference-level-explanation]: #reference-level-explanation
5762

58-
The tokenizer should accept all known escape codes in both `""` and `b""` literals.
59-
Only a regular string literal is checked to be valid UTF-8 afterwards.
63+
The ["characters and strings" section in the Rust Reference](https://doc.rust-lang.org/reference/tokens.html#characters-and-strings)
64+
is updated with the following table:
65+
66+
  | Example | Characters | Escapes | Validation
67+
-- | -- | -- | -- |
68+
Character | 'H' | All Unicode | ASCII, unicode | Valid unicode code point
69+
String | "hello" | All Unicode | ASCII, high byte, unicode | Valid UTF-8
70+
Raw string | r#"hello"# | All Unicode | - | Valid UTF-8
71+
Byte | b'H' | All ASCII | ASCII, high byte | -
72+
Byte string | b"hello" | All Unicode | ASCII, high byte, unicode | -
73+
Raw byte string | br#"hello"# | All Unicode | - | -
74+
75+
With the following definitions for the escape codes:
76+
77+
- ASCII: `\'`, `\"`, `\n`, `\r`, `\t`, `\\`, `\0`, `\u{0}` through `\u{7F}`, `\x00` through `\x7F`
78+
- Unicode: `\u{80}` and beyond.
79+
- High byte: `\x80` through `\xFF`
80+
81+
Compared to before, the tokenizer should start accepting:
82+
- unicode characters in `b""` and `br""` literals (which will be encoded as UTF-8),
83+
- all `\x` escapes in `""` literals,
84+
- all `\u` escapes in `b""` literals (which will be encoded as UTF-8), and
85+
- ASCII `\u` escapes in `b''` literals.
86+
87+
Regular string literals (`""`) are checked to be valid UTF-8 afterwards.
88+
(Either during tokenization, or at a later point in time. See future possibilities.)
6089

6190
# Drawbacks
6291
[drawbacks]: #drawbacks
@@ -87,19 +116,6 @@ However, for regular string literals that will result in an error in nearly all
87116

88117
(I don't care. I guess we should do whatever is easiest to implement.)
89118

90-
- How about single byte and character literals?
91-
92-
- Should `b'\u{30}` work? (It's a unicode escape code, but it's still just one byte in UTF-8.)
93-
94-
I think yes. I see no reason to disallow it.
95-
96-
- Should `'\xf0\x9f\xa6\x80'` work? (It's multiple escape codes, but it's still just one character in UTF-8.)
97-
98-
Probably not, since a `char` is not UTF-8 encoded; it's a single UTF-32 codepoint.
99-
_Decoding_ UTF-8 from `\x` escape codes back into UTF-32 would be a bit surprising.
100-
101-
(But note that `'\x41'` already works, for single byte UTF-8 characters, aka ASCII.)
102-
103119
# Future possibilities
104120
[future-possibilities]: #future-possibilities
105121

0 commit comments

Comments
 (0)