Skip to content

Commit 392a290

Browse files
committed
Add mixed utf8 literals rfc.
1 parent cff401d commit 392a290

File tree

1 file changed

+109
-0
lines changed

1 file changed

+109
-0
lines changed

text/0000-mixed-utf8-literals.md

Lines changed: 109 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,109 @@
1+
- Feature Name: `mixed_utf8_literals`
2+
- Start Date: 2022-11-15
3+
- RFC PR: [rust-lang/rfcs#0000](https://github.com/rust-lang/rfcs/pull/0000)
4+
- Rust Issue: [rust-lang/rust#0000](https://github.com/rust-lang/rust/issues/0000)
5+
6+
# Summary
7+
[summary]: #summary
8+
9+
Allow the exact same characters and escape codes in `"…"` and `b"…"` literals.
10+
11+
That is:
12+
13+
- Allow unicode characters, including `\u{…}` escape codes, in byte string literals. E.g. `b"hello\xff我叫\u{1F980}"`
14+
- Allow `\x…` escape codes in regular string literals, as long as they are valid UTF-8. E.g. `"\xf0\x9f\xa6\x80"`
15+
16+
# Motivation
17+
[motivation]: #motivation
18+
19+
Byte strings (`[u8]`) are a strict superset of regular (utf-8) strings (`str`),
20+
but Rust's byte string literals are currently not a superset of regular string literals:
21+
they reject non-ascii characters and `\u{…}` escape codes.
22+
23+
```
24+
error: non-ASCII character in byte constant
25+
--> src/main.rs:2:16
26+
|
27+
2 | b"hello\xff你\u{597d}"
28+
| ^^ byte constant must be ASCII
29+
|
30+
31+
error: unicode escape in byte string
32+
--> src/main.rs:2:17
33+
|
34+
2 | b"hello\xff你\u{597d}"
35+
| ^^^^^^^^ unicode escape in byte string
36+
|
37+
```
38+
39+
This can be annoying when working with "conventionally UTF-8" strings, such as with the popular [`bstr` crate](https://docs.rs/bstr/latest/bstr/).
40+
For example, right now, there is no convenient way to write a literal like `b"hello\xff你好"`.
41+
42+
Allowing all characters and escape codes in both types of string literals reduces the complexity of the language.
43+
We'd no longer have [different escape codes](https://doc.rust-lang.org/reference/tokens.html#characters-and-strings)
44+
for different literal types. We'd only require regular string literals to be valid UTF-8.
45+
46+
If we can postpone the UTF-8 validation until the point where tokens are turned into literals, then this not only simplifies the job of the tokenizer,
47+
but allows macros to take string literals with invalid UTF-8 (through `$_:tt` or `TokenTree`).
48+
That can be useful for macros like `cstr!("…")` and `wide!("…")`, etc., which currently unnecessarily result in errors for non-UTF-8 data:
49+
50+
```
51+
error: out of range hex escape
52+
--> src/main.rs:3:13
53+
|
54+
3 | cstr!("¿\xff");
55+
| ^^^^ must be a character in the range [\x00-\x7f]
56+
```
57+
58+
# Guide-level explanation
59+
[guide-level-explanation]: #guide-level-explanation
60+
61+
Regular string literals (`""`) must be valid UTF-8. For example, valid strings are `"abc"`, `"🦀"`, `"\u{1F980}"` and `"\xf0\x9f\xa6\x80"`.
62+
`"\x80"` is not valid, however, as that is not valid UTF-8.
63+
64+
Byte string literals (`b""`) may include non-ascii characters and unicode escape codes (`\u{…}`), which will be encoded as UTF-8.
65+
66+
# Reference-level explanation
67+
[reference-level-explanation]: #reference-level-explanation
68+
69+
The tokenizer should accept all escape codes in both `""` and `b""` literals.
70+
Only a regular string literal is checked for invalid UTF-8, but only at the point where the token is converted to a string literal AST node.
71+
72+
Just like how `$_:tt` accepts a thousand-digit integer literal but `$_:literal` does not,
73+
a `$_:tt` should accept `"\x80"`, but `$_:literal` should not.
74+
Similar, proc macros should be able to consume invalid UTF-8 string literals as `TokenTree`.
75+
76+
# Drawbacks
77+
[drawbacks]: #drawbacks
78+
79+
One might unintentionally write `\xf0` instead of `\u{f0}`.
80+
However, for regular string literals that will result in an error in nearly all cases, since that's not valid UTF-8 by itself.
81+
82+
# Alternatives
83+
[alternatives]: #alternatives
84+
85+
- Only extend `b""`, but still don't accept `\x` in regular string literals (`""`).
86+
87+
- Stabilize `concat_bytes!()` and require writing `"hello\xff你好"` as `concat_bytes!(b"hello\xff", "你好")`.
88+
(Assuming we extend the macro to accept a mix of byte string literals and regular string literals.)
89+
90+
# Prior art
91+
[prior-art]: #prior-art
92+
93+
- C and C++ do the same. (Assuming UTF-8 character set.)
94+
- [The `bstr` crate](https://docs.rs/bstr/latest/bstr/)
95+
- Python and Javascript do it differently: `\xff` mean `\u{ff}`, because their strings behave like UTF-32 or UTF-16 rather than UTF-8.
96+
(Also, Python's byte strings "accept" `\u` escape codes as just `'\\', 'u'`, without any warning or error.)
97+
98+
# Unresolved questions
99+
[unresolved-questions]: #unresolved-questions
100+
101+
- Should `concat!("\xf0\x9f", "\xa6\x80")` work? (The string literals are not valid UTF-8 individually, but are valid UTF-8 after being concatenated.)
102+
103+
(I don't care. I guess we should do whatever is easiest to implement.)
104+
105+
# Future possibilities
106+
[future-possibilities]: #future-possibilities
107+
108+
- Update the `concat!()` macro to accept `b""` strings and also not implicitly convert integers to strings, such that `concat!(b"", $x, b"\0")` becomes usable.
109+
(This would need to happen over an edition.)

0 commit comments

Comments
 (0)