|
| 1 | +- Feature Name: `mixed_utf8_literals` |
| 2 | +- Start Date: 2022-11-15 |
| 3 | +- RFC PR: [rust-lang/rfcs#0000](https://github.com/rust-lang/rfcs/pull/0000) |
| 4 | +- Rust Issue: [rust-lang/rust#0000](https://github.com/rust-lang/rust/issues/0000) |
| 5 | + |
| 6 | +# Summary |
| 7 | +[summary]: #summary |
| 8 | + |
| 9 | +Allow the exact same characters and escape codes in `"…"` and `b"…"` literals. |
| 10 | + |
| 11 | +That is: |
| 12 | + |
| 13 | +- Allow unicode characters, including `\u{…}` escape codes, in byte string literals. E.g. `b"hello\xff我叫\u{1F980}"` |
| 14 | +- Allow `\x…` escape codes in regular string literals, as long as they are valid UTF-8. E.g. `"\xf0\x9f\xa6\x80"` |
| 15 | + |
| 16 | +# Motivation |
| 17 | +[motivation]: #motivation |
| 18 | + |
| 19 | +Byte strings (`[u8]`) are a strict superset of regular (utf-8) strings (`str`), |
| 20 | +but Rust's byte string literals are currently not a superset of regular string literals: |
| 21 | +they reject non-ascii characters and `\u{…}` escape codes. |
| 22 | + |
| 23 | +``` |
| 24 | +error: non-ASCII character in byte constant |
| 25 | + --> src/main.rs:2:16 |
| 26 | + | |
| 27 | +2 | b"hello\xff你\u{597d}" |
| 28 | + | ^^ byte constant must be ASCII |
| 29 | + | |
| 30 | +
|
| 31 | +error: unicode escape in byte string |
| 32 | + --> src/main.rs:2:17 |
| 33 | + | |
| 34 | +2 | b"hello\xff你\u{597d}" |
| 35 | + | ^^^^^^^^ unicode escape in byte string |
| 36 | + | |
| 37 | +``` |
| 38 | + |
| 39 | +This can be annoying when working with "conventionally UTF-8" strings, such as with the popular [`bstr` crate](https://docs.rs/bstr/latest/bstr/). |
| 40 | +For example, right now, there is no convenient way to write a literal like `b"hello\xff你好"`. |
| 41 | + |
| 42 | +Allowing all characters and escape codes in both types of string literals reduces the complexity of the language. |
| 43 | +We'd no longer have [different escape codes](https://doc.rust-lang.org/reference/tokens.html#characters-and-strings) |
| 44 | +for different literal types. We'd only require regular string literals to be valid UTF-8. |
| 45 | + |
| 46 | +If we can postpone the UTF-8 validation until the point where tokens are turned into literals, then this not only simplifies the job of the tokenizer, |
| 47 | +but allows macros to take string literals with invalid UTF-8 (through `$_:tt` or `TokenTree`). |
| 48 | +That can be useful for macros like `cstr!("…")` and `wide!("…")`, etc., which currently unnecessarily result in errors for non-UTF-8 data: |
| 49 | + |
| 50 | +``` |
| 51 | +error: out of range hex escape |
| 52 | + --> src/main.rs:3:13 |
| 53 | + | |
| 54 | +3 | cstr!("¿\xff"); |
| 55 | + | ^^^^ must be a character in the range [\x00-\x7f] |
| 56 | +``` |
| 57 | + |
| 58 | +# Guide-level explanation |
| 59 | +[guide-level-explanation]: #guide-level-explanation |
| 60 | + |
| 61 | +Regular string literals (`""`) must be valid UTF-8. For example, valid strings are `"abc"`, `"🦀"`, `"\u{1F980}"` and `"\xf0\x9f\xa6\x80"`. |
| 62 | +`"\x80"` is not valid, however, as that is not valid UTF-8. |
| 63 | + |
| 64 | +Byte string literals (`b""`) may include non-ascii characters and unicode escape codes (`\u{…}`), which will be encoded as UTF-8. |
| 65 | + |
| 66 | +# Reference-level explanation |
| 67 | +[reference-level-explanation]: #reference-level-explanation |
| 68 | + |
| 69 | +The tokenizer should accept all escape codes in both `""` and `b""` literals. |
| 70 | +Only a regular string literal is checked for invalid UTF-8, but only at the point where the token is converted to a string literal AST node. |
| 71 | + |
| 72 | +Just like how `$_:tt` accepts a thousand-digit integer literal but `$_:literal` does not, |
| 73 | +a `$_:tt` should accept `"\x80"`, but `$_:literal` should not. |
| 74 | +Similar, proc macros should be able to consume invalid UTF-8 string literals as `TokenTree`. |
| 75 | + |
| 76 | +# Drawbacks |
| 77 | +[drawbacks]: #drawbacks |
| 78 | + |
| 79 | +One might unintentionally write `\xf0` instead of `\u{f0}`. |
| 80 | +However, for regular string literals that will result in an error in nearly all cases, since that's not valid UTF-8 by itself. |
| 81 | + |
| 82 | +# Alternatives |
| 83 | +[alternatives]: #alternatives |
| 84 | + |
| 85 | +- Only extend `b""`, but still don't accept `\x` in regular string literals (`""`). |
| 86 | + |
| 87 | +- Stabilize `concat_bytes!()` and require writing `"hello\xff你好"` as `concat_bytes!(b"hello\xff", "你好")`. |
| 88 | + (Assuming we extend the macro to accept a mix of byte string literals and regular string literals.) |
| 89 | + |
| 90 | +# Prior art |
| 91 | +[prior-art]: #prior-art |
| 92 | + |
| 93 | +- C and C++ do the same. (Assuming UTF-8 character set.) |
| 94 | +- [The `bstr` crate](https://docs.rs/bstr/latest/bstr/) |
| 95 | +- Python and Javascript do it differently: `\xff` mean `\u{ff}`, because their strings behave like UTF-32 or UTF-16 rather than UTF-8. |
| 96 | + (Also, Python's byte strings "accept" `\u` escape codes as just `'\\', 'u'`, without any warning or error.) |
| 97 | + |
| 98 | +# Unresolved questions |
| 99 | +[unresolved-questions]: #unresolved-questions |
| 100 | + |
| 101 | +- Should `concat!("\xf0\x9f", "\xa6\x80")` work? (The string literals are not valid UTF-8 individually, but are valid UTF-8 after being concatenated.) |
| 102 | + |
| 103 | + (I don't care. I guess we should do whatever is easiest to implement.) |
| 104 | + |
| 105 | +# Future possibilities |
| 106 | +[future-possibilities]: #future-possibilities |
| 107 | + |
| 108 | +- Update the `concat!()` macro to accept `b""` strings and also not implicitly convert integers to strings, such that `concat!(b"", $x, b"\0")` becomes usable. |
| 109 | + (This would need to happen over an edition.) |
0 commit comments