Skip to content

Commit 9bf90df

Browse files
committed
Add new mixed_script_confusables lint
1 parent 70297a9 commit 9bf90df

File tree

1 file changed

+17
-1
lines changed

1 file changed

+17
-1
lines changed

text/0000-non-ascii-idents.md

Lines changed: 17 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -142,13 +142,28 @@ These are the three different naming conventions and how their corresponding lin
142142

143143
Note: Scripts with upper- and lowercase variants ("bicameral scripts") behave similar to ASCII. Scripts without this distinction ("unicameral scripts") are also usable but all identifiers look the same regardless if they refer to a type, variable or constant. Underscores can be used to separate words in unicameral scripts even in UpperCamelCase contexts.
144144

145+
## Mixed script confusables lint
146+
147+
We keep track of the script groups in use in a document using the comparison heuristics in [Unicode® Technical Standard #39 Unicode Security Mechanisms Section 5.2 Restriction-Level Detection][TR39RestrictionLevel].
148+
149+
We identify lists of code points which are `Allowed` by [UTS 39 section 3.1][TR39Allowed] (i.e., code points not already linted by `less_used_codepoints`) and are "exact" confusables between code points from other `Allowed` scripts. This is stuff like Cyrillic `о` (confusable with Latin `o`), but does not include things like Hebrew `ס` which is somewhat distinguishable from Latin `o`. This list of exact confusables can be modified in the future.
150+
151+
We expect most of these to be between Cyrillic-Latin-Greek and some in Ethiopic-Armenian, but a proper review can be done before stabilization. There are also confusable modifiers between many script.
152+
153+
In a code base, if the _only_ code points from a given script group (aside from `Latin`, `Common`, and `Inherited`) are such exact confusables, lint about it with `mixed_script_confusables` (lint name can be finalized later).
154+
155+
As an implementation note, it may be worth dealing with confusable modifiers via a separate lint check -- if a modifier is from a different (non-`Common`/`Inherited`) script group from the thing preceding it. This has some behaviorial differences but should not increase the chance of false positives.
156+
157+
The exception for `Latin` is made because the standard library is Latin-script. It could potentially be removed since a code base using the standard library (or any Latin-using library) is likely to be using enough of it that there will be non-confusable characters in use. (This is in unresolved questions)
158+
159+
145160
## Reusability
146161

147162
The code used for implementing the various lints and checks will be released to crates.io. This includes:
148163

149164
- Testing validity of an identifier
150165
- Testing for `less_used_codepoints` ([UTS #39 Section 3.1][TR39Allowed])
151-
- Script identification and comparison for `mixed_script_detection` ([UTS #39 Section 5.2][TR39RestrictionLevel])
166+
- Script identification and comparison for `mixed_script_confusables` ([UTS #39 Section 5.2][TR39RestrictionLevel])
152167
- `skeleton(X)` algorithm for confusable detection ([UTS #39 Section 4][TR39Confusable])
153168

154169
Confusables detection works well when there are other identifiers to compare against, but in some cases there's only one instance of an identifier in the code, and it's compared with user-supplied strings. For example we have crates that use proc macros to expose command line options or REST endpoints. Crates that do things like these can use such algorithms to ensure better error handling; for example if we accidentally end up having an `/арр` endpoint (in Cyrillic) because of a `#[annotation] fn арр()`, visiting `/app` (in Latin) may show a comprehensive error (or pass-through, based on requirements)
@@ -262,6 +277,7 @@ The [Go language][Go] allows identifiers in the form **Letter (Letter | Number)\
262277
* How badly do non-ASCII idents exacerbate const pattern confusion
263278
(rust-lang/rust#7526, rust-lang/rust#49680)?
264279
Can we improve precision of linting here?
280+
* In `mixed_script_confusables`, do we actually need to make an exception for `Latin` identifiers?
265281

266282

267283
[PEP 3131]: https://www.python.org/dev/peps/pep-3131/

0 commit comments

Comments
 (0)