|
6 | 6 | # Summary
|
7 | 7 | [summary]: #summary
|
8 | 8 |
|
9 |
| -Allow the exact same characters and escape codes in `"…"` and `b"…"` literals. |
| 9 | +Relax the restrictions on which escape codes are allowed in string, char, and byte literals. |
10 | 10 |
|
11 |
| -That is: |
| 11 | +Most importantly, this means we accept the exact same characters and escape codes in `"…"` and `b"…"` literals. That is: |
12 | 12 |
|
13 | 13 | - Allow unicode characters, including `\u{…}` escape codes, in byte string literals. E.g. `b"hello\xff我叫\u{1F980}"`
|
14 | 14 | - Also allow non-ASCII `\x…` escape codes in regular string literals, as long as they are valid UTF-8. E.g. `"\xf0\x9f\xa6\x80"`
|
@@ -46,17 +46,46 @@ for different literal types. We'd only require regular string literals to be val
|
46 | 46 | # Guide-level explanation
|
47 | 47 | [guide-level-explanation]: #guide-level-explanation
|
48 | 48 |
|
49 |
| -Regular string literals (`""`) must be valid UTF-8. |
| 49 | +Regular string literals (`""` and `r""`) must be valid UTF-8. |
50 | 50 | For example, valid strings are `"abc"`, `"🦀"`, `"\u{1F980}"` and `"\xf0\x9f\xa6\x80"`.
|
51 |
| -`"\x80"` is not valid, however, as that is not valid UTF-8. |
| 51 | +`"\xff"` is not valid, however, as that is not valid UTF-8. |
52 | 52 |
|
53 |
| -Byte string literals (`b""`) may include non-ascii characters and unicode escape codes (`\u{…}`), which will be encoded as UTF-8. |
| 53 | +Byte string literals (`b""` and `br""`) may include non-ascii characters and unicode escape codes (`\u{…}`), which will be encoded as UTF-8. |
| 54 | + |
| 55 | +The `char` type does not store UTF-8, so while `'\u{1F980}'` is valid, trying to encode it in UTF-8 as in `'\xf0\x9f\xa6\x80'` is not accepted. |
| 56 | +In a char literal (`''`), `\x` may only be used for values 0 through 0x7F. |
| 57 | + |
| 58 | +Similarly, in a byte literal (`b''`), `\u` may only be used for values 0 through 0x7F, since those are the only code points that are unambiguously represented as a single byte. |
54 | 59 |
|
55 | 60 | # Reference-level explanation
|
56 | 61 | [reference-level-explanation]: #reference-level-explanation
|
57 | 62 |
|
58 |
| -The tokenizer should accept all known escape codes in both `""` and `b""` literals. |
59 |
| -Only a regular string literal is checked to be valid UTF-8 afterwards. |
| 63 | +The ["characters and strings" section in the Rust Reference](https://doc.rust-lang.org/reference/tokens.html#characters-and-strings) |
| 64 | +is updated with the following table: |
| 65 | + |
| 66 | + | Example | Characters | Escapes | Validation |
| 67 | +-- | -- | -- | -- | |
| 68 | +Character | 'H' | All Unicode | ASCII, unicode | Valid unicode code point |
| 69 | +String | "hello" | All Unicode | ASCII, high byte, unicode | Valid UTF-8 |
| 70 | +Raw string | r#"hello"# | All Unicode | - | Valid UTF-8 |
| 71 | +Byte | b'H' | All ASCII | ASCII, high byte | - |
| 72 | +Byte string | b"hello" | All Unicode | ASCII, high byte, unicode | - |
| 73 | +Raw byte string | br#"hello"# | All Unicode | - | - |
| 74 | + |
| 75 | +With the following definitions for the escape codes: |
| 76 | + |
| 77 | +- ASCII: `\'`, `\"`, `\n`, `\r`, `\t`, `\\`, `\0`, `\u{0}` through `\u{7F}`, `\x00` through `\x7F` |
| 78 | +- Unicode: `\u{80}` and beyond. |
| 79 | +- High byte: `\x80` through `\xFF` |
| 80 | + |
| 81 | +Compared to before, the tokenizer should start accepting: |
| 82 | +- unicode characters in `b""` and `br""` literals (which will be encoded as UTF-8), |
| 83 | +- all `\x` escapes in `""` literals, |
| 84 | +- all `\u` escapes in `b""` literals (which will be encoded as UTF-8), and |
| 85 | +- ASCII `\u` escapes in `b''` literals. |
| 86 | + |
| 87 | +Regular string literals (`""`) are checked to be valid UTF-8 afterwards. |
| 88 | +(Either during tokenization, or at a later point in time. See future possibilities.) |
60 | 89 |
|
61 | 90 | # Drawbacks
|
62 | 91 | [drawbacks]: #drawbacks
|
@@ -87,19 +116,6 @@ However, for regular string literals that will result in an error in nearly all
|
87 | 116 |
|
88 | 117 | (I don't care. I guess we should do whatever is easiest to implement.)
|
89 | 118 |
|
90 |
| -- How about single byte and character literals? |
91 |
| - |
92 |
| - - Should `b'\u{30}` work? (It's a unicode escape code, but it's still just one byte in UTF-8.) |
93 |
| - |
94 |
| - I think yes. I see no reason to disallow it. |
95 |
| - |
96 |
| - - Should `'\xf0\x9f\xa6\x80'` work? (It's multiple escape codes, but it's still just one character in UTF-8.) |
97 |
| - |
98 |
| - Probably not, since a `char` is not UTF-8 encoded; it's a single UTF-32 codepoint. |
99 |
| - _Decoding_ UTF-8 from `\x` escape codes back into UTF-32 would be a bit surprising. |
100 |
| - |
101 |
| - (But note that `'\x41'` already works, for single byte UTF-8 characters, aka ASCII.) |
102 |
| - |
103 | 119 | # Future possibilities
|
104 | 120 | [future-possibilities]: #future-possibilities
|
105 | 121 |
|
|
0 commit comments