Skip to content

Commit 3892774

Browse files
committed
Move the "validate later" part to future possibilities.
1 parent 679c165 commit 3892774

File tree

1 file changed

+14
-22
lines changed

1 file changed

+14
-22
lines changed

text/3349-mixed-utf8-literals.md

Lines changed: 14 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -39,39 +39,24 @@ error: unicode escape in byte string
3939
This can be annoying when working with "conventionally UTF-8" strings, such as with the popular [`bstr` crate](https://docs.rs/bstr/latest/bstr/).
4040
For example, right now, there is no convenient way to write a literal like `b"hello\xff你好"`.
4141

42-
Allowing all characters and escape codes in both types of string literals reduces the complexity of the language.
42+
Allowing all characters and all known escape codes in both types of string literals reduces the complexity of the language.
4343
We'd no longer have [different escape codes](https://doc.rust-lang.org/reference/tokens.html#characters-and-strings)
4444
for different literal types. We'd only require regular string literals to be valid UTF-8.
4545

46-
If we can postpone the UTF-8 validation until the point where tokens are turned into literals, then this not only simplifies the job of the tokenizer,
47-
but allows macros to take string literals with invalid UTF-8 (through `$_:tt` or `TokenTree`).
48-
That can be useful for macros like `cstr!("…")` and `wide!("…")`, etc., which currently unnecessarily result in errors for non-UTF-8 data:
49-
50-
```
51-
error: out of range hex escape
52-
--> src/main.rs:3:13
53-
|
54-
3 | cstr!("¿\xff");
55-
| ^^^^ must be a character in the range [\x00-\x7f]
56-
```
57-
5846
# Guide-level explanation
5947
[guide-level-explanation]: #guide-level-explanation
6048

61-
Regular string literals (`""`) must be valid UTF-8. For example, valid strings are `"abc"`, `"🦀"`, `"\u{1F980}"` and `"\xf0\x9f\xa6\x80"`.
49+
Regular string literals (`""`) must be valid UTF-8.
50+
For example, valid strings are `"abc"`, `"🦀"`, `"\u{1F980}"` and `"\xf0\x9f\xa6\x80"`.
6251
`"\x80"` is not valid, however, as that is not valid UTF-8.
6352

6453
Byte string literals (`b""`) may include non-ascii characters and unicode escape codes (`\u{…}`), which will be encoded as UTF-8.
6554

6655
# Reference-level explanation
6756
[reference-level-explanation]: #reference-level-explanation
6857

69-
The tokenizer should accept all escape codes in both `""` and `b""` literals.
70-
Only a regular string literal is checked for invalid UTF-8, but only at the point where the token is converted to a string literal AST node.
71-
72-
Just like how `$_:tt` accepts a thousand-digit integer literal but `$_:literal` does not,
73-
a `$_:tt` should accept `"\x80"`, but `$_:literal` should not.
74-
Similar, proc macros should be able to consume invalid UTF-8 string literals as `TokenTree`.
58+
The tokenizer should accept all known escape codes in both `""` and `b""` literals.
59+
Only a regular string literal is checked to be valid UTF-8 afterwards.
7560

7661
# Drawbacks
7762
[drawbacks]: #drawbacks
@@ -92,8 +77,8 @@ However, for regular string literals that will result in an error in nearly all
9277

9378
- C and C++ do the same. (Assuming UTF-8 character set.)
9479
- [The `bstr` crate](https://docs.rs/bstr/latest/bstr/)
95-
- Python and Javascript do it differently: `\xff` mean `\u{ff}`, because their strings behave like UTF-32 or UTF-16 rather than UTF-8.
96-
(Also, Python's byte strings "accept" `\u` escape codes as just `'\\', 'u'`, without any warning or error.)
80+
- Python and Javascript do it differently: `\xff` means `\u{ff}`, because their strings behave like UTF-32 or UTF-16 rather than UTF-8.
81+
(Also, Python's byte strings "accept" `\u` as just `'\\', 'u'`, without any warning or error.)
9782

9883
# Unresolved questions
9984
[unresolved-questions]: #unresolved-questions
@@ -113,8 +98,15 @@ However, for regular string literals that will result in an error in nearly all
11398
Probably not, since a `char` is not UTF-8 encoded; it's a single UTF-32 codepoint.
11499
_Decoding_ UTF-8 from `\x` escape codes back into UTF-32 would be a bit surprising.
115100

101+
(But note that `'\x41'` already works, for single byte UTF-8 characters, aka ASCII.)
102+
116103
# Future possibilities
117104
[future-possibilities]: #future-possibilities
118105

106+
- Postpone the UTF-8 validation to a later stage, such that macros can accept literals with invalid UTF-8. E.g. `cstr!("\xff")`.
107+
108+
- If we do that, we could also decide to accept _all_ escape codes, even unknown ones, to allow things like `some_macro!("\a\b\c")`.
109+
(The tokenizer would only need to know about `\"`.)
110+
119111
- Update the `concat!()` macro to accept `b""` strings and also not implicitly convert integers to strings, such that `concat!(b"", $x, b"\0")` becomes usable.
120112
(This would need to happen over an edition.)

0 commit comments

Comments
 (0)