You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: text/0000-non-ascii-idents.md
+17-1Lines changed: 17 additions & 1 deletion
Original file line number
Diff line number
Diff line change
@@ -142,13 +142,28 @@ These are the three different naming conventions and how their corresponding lin
142
142
143
143
Note: Scripts with upper- and lowercase variants ("bicameral scripts") behave similar to ASCII. Scripts without this distinction ("unicameral scripts") are also usable but all identifiers look the same regardless if they refer to a type, variable or constant. Underscores can be used to separate words in unicameral scripts even in UpperCamelCase contexts.
144
144
145
+
## Mixed script confusables lint
146
+
147
+
We keep track of the script groups in use in a document using the comparison heuristics in [Unicode® Technical Standard #39 Unicode Security Mechanisms Section 5.2 Restriction-Level Detection][TR39RestrictionLevel].
148
+
149
+
We identify lists of code points which are `Allowed` by [UTS 39 section 3.1][TR39Allowed] (i.e., code points not already linted by `less_used_codepoints`) and are "exact" confusables between code points from other `Allowed` scripts. This is stuff like Cyrillic `о` (confusable with Latin `o`), but does not include things like Hebrew `ס` which is somewhat distinguishable from Latin `o`. This list of exact confusables can be modified in the future.
150
+
151
+
We expect most of these to be between Cyrillic-Latin-Greek and some in Ethiopic-Armenian, but a proper review can be done before stabilization. There are also confusable modifiers between many script.
152
+
153
+
In a code base, if the _only_ code points from a given script group (aside from `Latin`, `Common`, and `Inherited`) are such exact confusables, lint about it with `mixed_script_confusables` (lint name can be finalized later).
154
+
155
+
As an implementation note, it may be worth dealing with confusable modifiers via a separate lint check -- if a modifier is from a different (non-`Common`/`Inherited`) script group from the thing preceding it. This has some behaviorial differences but should not increase the chance of false positives.
156
+
157
+
The exception for `Latin` is made because the standard library is Latin-script. It could potentially be removed since a code base using the standard library (or any Latin-using library) is likely to be using enough of it that there will be non-confusable characters in use. (This is in unresolved questions)
158
+
159
+
145
160
## Reusability
146
161
147
162
The code used for implementing the various lints and checks will be released to crates.io. This includes:
148
163
149
164
- Testing validity of an identifier
150
165
- Testing for `less_used_codepoints` ([UTS #39 Section 3.1][TR39Allowed])
151
-
- Script identification and comparison for `mixed_script_detection` ([UTS #39 Section 5.2][TR39RestrictionLevel])
166
+
- Script identification and comparison for `mixed_script_confusables` ([UTS #39 Section 5.2][TR39RestrictionLevel])
152
167
-`skeleton(X)` algorithm for confusable detection ([UTS #39 Section 4][TR39Confusable])
153
168
154
169
Confusables detection works well when there are other identifiers to compare against, but in some cases there's only one instance of an identifier in the code, and it's compared with user-supplied strings. For example we have crates that use proc macros to expose command line options or REST endpoints. Crates that do things like these can use such algorithms to ensure better error handling; for example if we accidentally end up having an `/арр` endpoint (in Cyrillic) because of a `#[annotation] fn арр()`, visiting `/app` (in Latin) may show a comprehensive error (or pass-through, based on requirements)
@@ -262,6 +277,7 @@ The [Go language][Go] allows identifiers in the form **Letter (Letter | Number)\
262
277
* How badly do non-ASCII idents exacerbate const pattern confusion
263
278
(rust-lang/rust#7526, rust-lang/rust#49680)?
264
279
Can we improve precision of linting here?
280
+
* In `mixed_script_confusables`, do we actually need to make an exception for `Latin` identifiers?
0 commit comments