You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I believe this would greatly improve things and solves all the issues,
mostly. It's a bit more complex, but not overly so, and can be
implemented without a Unicode library without too much effort. It offers
a good middle ground, IMHO.
I don't think there are ANY perfect solutions here and that *anything*
will be a trade-off. That said, I do believe some trade-offs are better
than others, and after looking at a bunch of different options I believe
this is by far the best path for TOML.
Advantages:
- This is what I would consider the "minimal set" of characters we need
to add for reasonable international support, meaning we can't really
make a mistake with this by accidentally allowing too much.
We can add new ranges in TOML 1.2 (or even change the entire approach,
although I'd be very surprised if we need to), based on actual
real-world feedback, but any approach we will take will need to
include letters and digits from all scripts.
This is a strong argument in favour of this and a huge improvement: we
can't really do anything wrong here in a way that we can't correct
later. Being conservative for these type of things is is good!
- This solves the normalisation issues, since combining characters are
no longer allowed in bare keys, so it becomes a moot point.
For quoted keys normalisation is mostly a non-issue because few people
use them and the specification even strongly discourages people from
using them, which is why this gone largely unnoticed and undiscussed
before the "Unicode in bare keys" PR was merged.[1]
- It's consistent in what we allow: no "this character is allowed, but
this very similar other thing isn't, what gives?!"
Note that toml-lang#954 was NOT about "I want all emojis to work" per se, but
"this character works fine, but this very similar doesn't". This shows
up in a number of things aside from emojis:
a.toml:
Input: ; = 42 # U+037E GREEK QUESTION MARK (Other_Punctuation)
Error: line 1: expected '.' or '=', but got ';' instead
b.toml:
Input: · = 42 # # U+0387 GREEK ANO TELEIA (Other_Punctuation)
Error: (none)
c.toml:
Input: – = 42 # U+2013 EN DASH (Dash_Punctuation)
Error: line 1: expected '.' or '=', but got '–' instead
d.toml:
Input: ⁻ = 42 # U+207B SUPERSCRIPT MINUS (Math_Symbol)
Error: (none)
e.toml:
Input: #x = "commented ... or is it?" # # U+FF03 FULLWIDTH NUMBER SIGN (Other_Punctuation)
Error: (none)
"Some punctuation is allowed but some isn't" is hard to explain, and
also not what the specification says: "Punctuation, spaces, arrows,
box drawing and private use characters are not allowed." In reality, a
lot of punctuation IS allowed, but not all.
People don't read specifications, nor should they. People try
something and sees if it works. Now it seems to work on first
approximation, and then (possibly months later) it seems to "break".
It should either allow everything or nothing. This in-between is just
horrible. From the user's perspective this seems like a bug in the
TOML parser, but it's not: it's a bug in the specification.
There is no good way to communicate this other than "these codepoints,
which cover most of what you'd write in a sentence, except when it
doesn't".
In contrast, "we allow letters and digits" is simple to spec, simple
to communicate, and should have a minimum potential for confusion. The
current spec disallows some things seemingly almost arbitrary while
allowing other very similar characters.
- This avoids a long list of confusable special TOML characters; some
were mentioned above but there are many more:
'#' U+FF03 FULLWIDTH NUMBER SIGN (Other_Punctuation)
'"' U+FF02 FULLWIDTH QUOTATION MARK (Other_Punctuation)
'﹟' U+FE5F SMALL NUMBER SIGN (Other_Punctuation)
'﹦' U+FE66 SMALL EQUALS SIGN (Math_Symbol)
'﹐' U+FE50 SMALL COMMA (Other_Punctuation)
'︲' U+FE32 PRESENTATION FORM FOR VERTICAL EN DASH (Dash_Punctuation)
'˝' U+02DD DOUBLE ACUTE ACCENT (Modifier_Symbol)
'՚' U+055A ARMENIAN APOSTROPHE (Other_Punctuation)
'܂' U+0702 SYRIAC SUBLINEAR FULL STOP (Other_Punctuation)
'ᱹ' U+1C79 OL CHIKI GAAHLAA TTUDDAAG (Modifier_Letter)
'₌' U+208C SUBSCRIPT EQUALS SIGN (Math_Symbol)
'⹀' U+2E40 DOUBLE HYPHEN (Dash_Punctuation)
'࠰' U+0830 SAMARITAN PUNCTUATION NEQUDAA (Other_Punctuation)
Is this a big problem? I guess it depends; I can certainly imagine an
Armenian speaker accidentally leaving an Armenian apostrophe.
- Maps to identifiers in more (though not all) languages. We discussed
whether TOML keys are "strings" or "identifiers" last week in toml-lang#966 and
while views differ (mostly because they're both) it seems to me that
making it map *closer* is better. This is a minor issue, but it's
nice.
That does not mean it's perfect; as I mentioned all solutions come with
a trade-off. The ones made here are:
- The biggest issue by far is that the check to see if a character is
valid may become more complex for some languages and environments that
can't rely on a Unicode database being present.
However, implementing this check is trivial logic-wise: it just needs
to loop over every character and check if it's in a range table.
The downside is it needs a somewhat large-ish "allowed characters"
table with 716 start/stop ranges, which is not ideal, but entirely
doable and easily auto-generated. It's ~164 lines hard-wrapped at
column 80 (or ~111 lines hard-wrapped at col 120). tomlc99 is 2,387
lines, so that seems within the limits of reason (actually, reading
through the code adding multibyte support in the first case will
probably be harder, with this range table being a minor part).
- There's a new Unicode version roughly every year or so, and the way
it's written now means it's "locked" to Unicode 9 or, optionally, a
later version. This is probably fine: Apple's APFS filesystem (which
does normalisation) is "locked" to Unicode 9.0; HFS+ was Unicode 3.2.
Go is Unicode 8.0. etc. I don't think this is really much of an issue
in practice.
I choose Unicode 9 as everyone supports this; I doubted a long time
over it, and we can also use a more recent version. I feel this gives
us a nice balance between reasonable interoperability while also
future-proofing things.
- ABNF doesn't support Unicode. This is a tooling issue, and in my
opinion the tooling should adjust to how we want TOML to look like,
rather than adjusting TOML to what tooling supports. AFAIK no one uses
the ABNF directly in code, and it's merely "informational".
I'm not happy with this, but personally I think this should be a
non-issue when considering what to do here. We're not the only people
running in to this limitation, and is really something that IETF
should address in a new RFC or something "Extra Augmented BNF?"
Another solution I tried is restricting the code ranges; I twice tried
to do this (with some months in-between) and spent a long time looking
at Unicode blocks and ranges, and I found this impractical: we'll end up
with a long list which isn't all that different from what this proposal
adds.
Fixestoml-lang#954Fixestoml-lang#966Fixestoml-lang#979
Ref toml-lang#687
Ref toml-lang#891
Ref toml-lang#941
---
[1]:
Aside: I encountered this just the other day as I created a TOML file
with all UK election results since 1945, which looks like:
[1950]
Labour = [13_266_176, 315, 617]
Conservative = [12_492_404, 298, 619]
Liberal = [ 2_621_487, 9, 475]
Sinn_Fein = [ 23_362, 0, 2]
That should be Sinn_Féin, but "Sinn_Féin" seemed ugly, so I just wrote
it as Sinn_Fein. This is what most people seem to do.
0 commit comments