Skip to content

Commit 5784e2b

Browse files
committed
Merge remote-tracking branch 'SimonSapin/ascii-literals'
2 parents f1d6906 + 471fbe8 commit 5784e2b

File tree

1 file changed

+111
-0
lines changed

1 file changed

+111
-0
lines changed

active/0000-ascii-literals.md

Lines changed: 111 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,111 @@
1+
- Start Date: 2014-05-05
2+
- RFC PR #:
3+
- Rust Issue #:
4+
5+
# Summary
6+
7+
Add ASCII byte literals and ASCII byte string literals to the language,
8+
similar to the existing (Unicode) character and string literals.
9+
Before the RFC process was in place,
10+
this was discussed in [#4334](https://github.com/mozilla/rust/issues/4334).
11+
12+
13+
# Motivation
14+
15+
Programs dealing with text usually should use Unicode,
16+
represented in Rust by the `str` and `char` types.
17+
In some cases however,
18+
a program may be dealing with bytes that can not be interpreted as Unicode as a whole,
19+
but still contain ASCII compatible bits.
20+
21+
For example, the HTTP protocol was originally defined as Latin-1,
22+
but in practice different pieces of the same request or response
23+
can use different encodings.
24+
The PDF file format is mostly ASCII,
25+
but can contain UTF-16 strings and raw binary data.
26+
27+
There is a precedent at least in Python, which has both Unicode and byte strings.
28+
29+
30+
# Drawbacks
31+
32+
The language becomes slightly more complex,
33+
although that complexity should be limited to the parser.
34+
35+
36+
# Detailed design
37+
38+
Using terminology from [the Reference Manual](http://static.rust-lang.org/doc/master/rust.html#character-and-string-literals):
39+
40+
Extend the syntax of expressions and patterns to add
41+
byte literals of type `u8` and
42+
byte string literals of type `&'static [u8]` (or `[u8]`, post-DST).
43+
They are identical to the existing character and string literals, except that:
44+
45+
* They are prefixed with a `b` (for "binary"), to distinguish them.
46+
This is similar to the `r` prefix for raw strings.
47+
* Unescaped code points in the body must be in the ASCII range: U+0000 to U+007F.
48+
* `'\x5c' 'u' hex_digit 4` and `'\x5c' 'U' hex_digit 8` escapes are not allowed.
49+
* `'\x5c' 'x' hex_digit 2` escapes represent a single byte rather than a code point.
50+
(They are the only way to express a non-ASCII byte.)
51+
52+
Examples: `b'A' == 65u8`, `b'\t' == 9u8`, `b'\xFF' == 0xFFu8`,
53+
`b"A\t\xFF" == [65u8, 9, 0xFF]`
54+
55+
Assuming `buffer` of type `&[u8]`
56+
```rust
57+
match buffer[i] {
58+
b'a' .. b'z' => { /* ... */ }
59+
c => { /* ... */ }
60+
}
61+
```
62+
63+
64+
# Alternatives
65+
66+
Status quo: patterns must use numeric literals for ASCII values,
67+
or (for a single byte, not a byte string) cast to char
68+
69+
```rust
70+
match buffer[i] {
71+
c @ 0x61 .. 0x7A => { /* ... */ }
72+
c => { /* ... */ }
73+
}
74+
match buffer[i] as char {
75+
// `c` is of the wrong type!
76+
c @ 'a' .. 'z' => { /* ... */ }
77+
c => { /* ... */ }
78+
}
79+
```
80+
81+
Another option is to change the syntax so that macros such as
82+
[`bytes!()`](http://static.rust-lang.org/doc/master/std/macros/builtin/macro.bytes.html)
83+
can be used in patterns, and add a `byte!()` macro:
84+
85+
```rust
86+
match buffer[i] {
87+
c @ byte!('a') .. byte!('z') => { /* ... */ }
88+
c => { /* ... */ }
89+
}q
90+
```
91+
92+
This RFC was written to align the syntax with Python,
93+
but there could be many variations such as using a different prefix (maybe `a` for ASCII),
94+
or using a suffix instead (maybe `u8`, as in integer literals).
95+
96+
The code points from syntax could be encoded as UTF-8
97+
rather than being mapped to bytes of the same value,
98+
but assuming UTF-8 is not always appropriate when working with bytes.
99+
100+
See also previous discussion in [#4334](https://github.com/mozilla/rust/issues/4334).
101+
102+
103+
# Unresolved questions
104+
105+
Should there be "raw byte string" literals?
106+
E.g. `pdf_file.write(rb"<< /Title (FizzBuzz \(Part one\)) >>")`
107+
108+
Should control characters (U+0000 to U+001F) be disallowed in syntax?
109+
This should be consistent across all kinds of literals.
110+
111+
Should the `bytes!()` macro be removed in favor of this?

0 commit comments

Comments
 (0)