Skip to content

Commit c3b534e

Browse files
committed
Change bare key characters to Letter and Digit
I believe this would greatly improve things and solves all the issues, mostly. It's a bit more complex, but not overly so, and can be implemented without a Unicode library without too much effort. It offers a good middle ground, IMHO. I don't think there are ANY perfect solutions here and that *anything* will be a trade-off. That said, I do believe some trade-offs are better than others, and after looking at a bunch of different options I believe this is by far the best path for TOML. Advantages: - This is what I would consider the "minimal set" of characters we need to add for reasonable international support, meaning we can't really make a mistake with this by accidentally allowing too much. We can add new ranges in TOML 1.2 (or even change the entire approach, although I'd be very surprised if we need to), based on actual real-world feedback, but any approach we will take will need to include letters and digits from all scripts. This is a strong argument in favour of this and a huge improvement: we can't really do anything wrong here in a way that we can't correct later. Being conservative for these type of things is is good! - This solves the normalisation issues, since combining characters are no longer allowed in bare keys, so it becomes a moot point. For quoted keys normalisation is mostly a non-issue because few people use them and the specification even strongly discourages people from using them, which is why this gone largely unnoticed and undiscussed before the "Unicode in bare keys" PR was merged.[1] - It's consistent in what we allow: no "this character is allowed, but this very similar other thing isn't, what gives?!" Note that toml-lang#954 was NOT about "I want all emojis to work" per se, but "this character works fine, but this very similar doesn't". This shows up in a number of things aside from emojis: a.toml: Input: ; = 42 # U+037E GREEK QUESTION MARK (Other_Punctuation) Error: line 1: expected '.' or '=', but got ';' instead b.toml: Input: · = 42 # # U+0387 GREEK ANO TELEIA (Other_Punctuation) Error: (none) c.toml: Input: – = 42 # U+2013 EN DASH (Dash_Punctuation) Error: line 1: expected '.' or '=', but got '–' instead d.toml: Input: ⁻ = 42 # U+207B SUPERSCRIPT MINUS (Math_Symbol) Error: (none) e.toml: Input: #x = "commented ... or is it?" # # U+FF03 FULLWIDTH NUMBER SIGN (Other_Punctuation) Error: (none) "Some punctuation is allowed but some isn't" is hard to explain, and also not what the specification says: "Punctuation, spaces, arrows, box drawing and private use characters are not allowed." In reality, a lot of punctuation IS allowed, but not all. People don't read specifications, nor should they. People try something and sees if it works. Now it seems to work on first approximation, and then (possibly months later) it seems to "break". It should either allow everything or nothing. This in-between is just horrible. From the user's perspective this seems like a bug in the TOML parser, but it's not: it's a bug in the specification. There is no good way to communicate this other than "these codepoints, which cover most of what you'd write in a sentence, except when it doesn't". In contrast, "we allow letters and digits" is simple to spec, simple to communicate, and should have a minimum potential for confusion. The current spec disallows some things seemingly almost arbitrary while allowing other very similar characters. - This avoids a long list of confusable special TOML characters; some were mentioned above but there are many more: '#' U+FF03 FULLWIDTH NUMBER SIGN (Other_Punctuation) '"' U+FF02 FULLWIDTH QUOTATION MARK (Other_Punctuation) '﹟' U+FE5F SMALL NUMBER SIGN (Other_Punctuation) '﹦' U+FE66 SMALL EQUALS SIGN (Math_Symbol) '﹐' U+FE50 SMALL COMMA (Other_Punctuation) '︲' U+FE32 PRESENTATION FORM FOR VERTICAL EN DASH (Dash_Punctuation) '˝' U+02DD DOUBLE ACUTE ACCENT (Modifier_Symbol) '՚' U+055A ARMENIAN APOSTROPHE (Other_Punctuation) '܂' U+0702 SYRIAC SUBLINEAR FULL STOP (Other_Punctuation) 'ᱹ' U+1C79 OL CHIKI GAAHLAA TTUDDAAG (Modifier_Letter) '₌' U+208C SUBSCRIPT EQUALS SIGN (Math_Symbol) '⹀' U+2E40 DOUBLE HYPHEN (Dash_Punctuation) '࠰' U+0830 SAMARITAN PUNCTUATION NEQUDAA (Other_Punctuation) Is this a big problem? I guess it depends; I can certainly imagine an Armenian speaker accidentally leaving an Armenian apostrophe. - Maps to identifiers in more (though not all) languages. We discussed whether TOML keys are "strings" or "identifiers" last week in toml-lang#966 and while views differ (mostly because they're both) it seems to me that making it map *closer* is better. This is a minor issue, but it's nice. That does not mean it's perfect; as I mentioned all solutions come with a trade-off. The ones made here are: - The biggest issue by far is that the check to see if a character is valid may become more complex for some languages and environments that can't rely on a Unicode database being present. However, implementing this check is trivial logic-wise: it just needs to loop over every character and check if it's in a range table. The downside is it needs a somewhat large-ish "allowed characters" table with 716 start/stop ranges, which is not ideal, but entirely doable and easily auto-generated. It's ~164 lines hard-wrapped at column 80 (or ~111 lines hard-wrapped at col 120). tomlc99 is 2,387 lines, so that seems within the limits of reason (actually, reading through the code adding multibyte support in the first case will probably be harder, with this range table being a minor part). - There's a new Unicode version roughly every year or so, and the way it's written now means it's "locked" to Unicode 9 or, optionally, a later version. This is probably fine: Apple's APFS filesystem (which does normalisation) is "locked" to Unicode 9.0; HFS+ was Unicode 3.2. Go is Unicode 8.0. etc. I don't think this is really much of an issue in practice. I choose Unicode 9 as everyone supports this; I doubted a long time over it, and we can also use a more recent version. I feel this gives us a nice balance between reasonable interoperability while also future-proofing things. - ABNF doesn't support Unicode. This is a tooling issue, and in my opinion the tooling should adjust to how we want TOML to look like, rather than adjusting TOML to what tooling supports. AFAIK no one uses the ABNF directly in code, and it's merely "informational". I'm not happy with this, but personally I think this should be a non-issue when considering what to do here. We're not the only people running in to this limitation, and is really something that IETF should address in a new RFC or something "Extra Augmented BNF?" Another solution I tried is restricting the code ranges; I twice tried to do this (with some months in-between) and spent a long time looking at Unicode blocks and ranges, and I found this impractical: we'll end up with a long list which isn't all that different from what this proposal adds. Fixes toml-lang#954 Fixes toml-lang#966 Fixes toml-lang#979 Ref toml-lang#687 Ref toml-lang#891 Ref toml-lang#941 --- [1]: Aside: I encountered this just the other day as I created a TOML file with all UK election results since 1945, which looks like: [1950] Labour = [13_266_176, 315, 617] Conservative = [12_492_404, 298, 619] Liberal = [ 2_621_487, 9, 475] Sinn_Fein = [ 23_362, 0, 2] That should be Sinn_Féin, but "Sinn_Féin" seemed ugly, so I just wrote it as Sinn_Fein. This is what most people seem to do.
1 parent 23c3fb7 commit c3b534e

File tree

2 files changed

+16
-21
lines changed

2 files changed

+16
-21
lines changed

toml.abnf

Lines changed: 7 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -52,16 +52,13 @@ simple-key = quoted-key / unquoted-key
5252

5353
;; Unquoted key
5454

55-
unquoted-key = 1*unquoted-key-char
56-
unquoted-key-char = ALPHA / DIGIT / %x2D / %x5F ; a-z A-Z 0-9 - _
57-
unquoted-key-char =/ %xB2 / %xB3 / %xB9 / %xBC-BE ; superscript digits, fractions
58-
unquoted-key-char =/ %xC0-D6 / %xD8-F6 / %xF8-37D ; non-symbol chars in Latin block
59-
unquoted-key-char =/ %x37F-1FFF ; exclude GREEK QUESTION MARK, which is basically a semi-colon
60-
unquoted-key-char =/ %x200C-200D / %x203F-2040 ; from General Punctuation Block, include the two tie symbols and ZWNJ, ZWJ
61-
unquoted-key-char =/ %x2070-218F / %x2460-24FF ; include super-/subscripts, letterlike/numberlike forms, enclosed alphanumerics
62-
unquoted-key-char =/ %x2C00-2FEF / %x3001-D7FF ; skip arrows, math, box drawing etc, skip 2FF0-3000 ideographic up/down markers and spaces
63-
unquoted-key-char =/ %xF900-FDCF / %xFDF0-FFFD ; skip D800-DFFF surrogate block, E000-F8FF Private Use area, FDD0-FDEF intended for process-internal use (unicode)
64-
unquoted-key-char =/ %x10000-EFFFF ; all chars outside BMP range, excluding Private Use planes (F0000-10FFFF)
55+
unquoted-key = 1*( ALPHA / DIGIT / %x2D / %x5F ) ; A-Z / a-z / 0-9 / - / _
56+
57+
; These cannot be easily expressed in ABNF.
58+
; unquoted-key =/ unicode-letter
59+
; unquoted-key =/ unicode-digit
60+
; unicode-letter = Lu / Ll / Lt / Lm / Lo ; Unicode categories
61+
; unicode-digit = Nd ; Unicode categories
6562

6663
;; Quoted and dotted key
6764

toml.md

Lines changed: 9 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -103,30 +103,28 @@ first = "Tom" last = "Preston-Werner" # INVALID
103103

104104
A key may be either bare, quoted, or dotted.
105105

106-
**Bare keys** may contain any letter-like or number-like Unicode character from
107-
any Unicode script, as well as ASCII digits, dashes and underscores.
108-
Punctuation, spaces, arrows, box drawing and private use characters are not
109-
allowed. Note that bare keys are allowed to be composed of only ASCII digits,
110-
e.g. 1234, but are always interpreted as strings.
106+
**Bare keys** may only contain letters, digits, underscores, and dashes. Bare
107+
keys are allowed to be composed of only digits, e.g. `1234`, but are always
108+
interpreted as strings.
111109

112-
ℹ️ The exact ranges of allowed code points can be found in the
113-
[ABNF grammar file][abnf].
110+
A "letter" is any character in the Unicode category `Lu`, `Ll`, `Lt`, `Lm`, or
111+
`Lo`. A "digit" is any character in the category `Nd`. Implementations must
112+
support at least Unicode 9.0, or optionally any later version.
114113

115114
```toml
116115
key = "value"
117116
bare_key = "value"
118117
bare-key = "value"
119118
1234 = "value"
120119
Fuß = "value"
121-
😂 = "value"
122120
汉语大字典 = "value"
123121
辭源 = "value"
124122
பெண்டிரேம் = "value"
125123
```
126124

127-
**Quoted keys** follow the exact same rules as either basic strings or literal
128-
strings and allow you to use any Unicode character in a key name, including
129-
spaces. Best practice is to use bare keys except when absolutely necessary.
125+
**Quoted keys** follow the same rules as either basic strings or literal
126+
strings, and allow you to use any character in a key name including spaces. Best
127+
practice is to use bare keys except when absolutely necessary.
130128

131129
```toml
132130
"127.0.0.1" = "value"

0 commit comments

Comments
 (0)