Skip to content

Commit 40d53f5

Browse files
committed
Global mixed script confusables lint
1 parent 9356fc1 commit 40d53f5

File tree

1 file changed

+25
-1
lines changed

1 file changed

+25
-1
lines changed

text/0000-non-ascii-idents.md

Lines changed: 25 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -117,6 +117,8 @@ The lint is triggered by identifiers that contain a codepoint that is not part o
117117

118118
Note: New Unicode versions update the set of allowed codepoints. Additionally the compiler authors may decide to allow more codepoints or warn about those that have been found to cause confusion.
119119

120+
For reference, a list of all the code points allowed by this lint can be found [here][unicode-set-allowed], with the script group mentioned on the right.
121+
120122
## Mixed script detection
121123

122124
A new `mixed_script_idents` lint is added to the compiler. The default setting is to `warn`.
@@ -125,6 +127,23 @@ The lint is triggered by identifiers that do not qualify for the "Moderately Res
125127

126128
Note: The definition of "Moderately Restrictive" can be changed by future versions of the Unicode standard to reflect changes in the natural languages used or for other reasons.
127129

130+
## Global mixed script detection with confusables
131+
132+
As an additional measure, we try to detect cases where a codebase primarily using a certain script has identifiers from a different script confusable with that script.
133+
134+
During `mixed_script_idents` computation, keep track of how often identifiers from various script groups crop up. If an identifier is from a less-common script group (say, <1% of identifiers), _and_ it is entirely confusable with the majority script in use (e.g. the string `"арр"` or `"роре"` in Cyrillic)
135+
136+
This can trigger `confusable_idents`, `mixed_script_idents`, or a new lint.
137+
138+
We identify sets of characters which are entirely confusable: For example, for Cyrillic-Latin, we have `а, е, о, р, с, у, х, ѕ, і, ј, ԛ, ԝ, ѐ, ё, ї, ӱ, ӧ, ӓ, ӕ, ӑ` amongst the lowercase letters (and more amongst the capitals). This list likely can be programmatically derived from the confusables data that Unicode already has. It may be worth filtering for exact confusables. For example, Cyrillic, Greek, and Latin have a lot of confusables that are almost indistinguishable in most fonts, whereas `ھ` and `ס` are noticeably different-looking from `o` even though they're marked as a confusables.
139+
140+
The main confusable script pairs we have to worry about are Cyrillic/Latin/Greek, Armenian/Ethiopic, and a couple Armenian characters mapping to Greek/Latin. We can implement this lint conservatively at first by dealing with a blacklist of known confusables for these script pairs, and expand it if there is a need.
141+
142+
There are many confusables _within_ scripts -- Arabic has a bunch of these as does Han (both with other Han characters and and with kana), but since these are within the same language group this is outside the scope of this RFC. Such confusables are equivalent to `l` vs `I` being confusable in some fonts.
143+
144+
For reference, a list of all possible Rust identifier characters that do not trip `less_used_codepoints` but have confusables can be found [here][unicode-set-confusables], with their confusable skeleton and script group mentioned on the right. Note that in many cases the confusables are visually distinguishable, or are diacritic marks.
145+
146+
128147
## Adjustments to the "bad style" lints
129148

130149
Rust [RFC 0430] establishes naming conventions for Rust ASCII identifiers. The *rustc* compiler includes lints to promote these recommendations.
@@ -151,7 +170,7 @@ The code used for implementing the various lints and checks will be released to
151170
- Script identification and comparison for `mixed_script_detection` ([UTS #39 Section 5.2][TR39RestrictionLevel])
152171
- `skeleton(X)` algorithm for confusable detection ([UTS #39 Section 4][TR39Confusable])
153172

154-
173+
Confusables detection works well when there are other identifiers to compare against, but in some cases there's only one instance of an identifier in the code. For example we have crates that use proc macros to expose command line options or REST endpoints. Crates that do things like these can use such algorithms to ensure better error handling; for example if we accidentally end up having an `/арр` endpoint (in Cyrillic) because of a `#[annotation] fn арр()`, visiting `/app` (in Latin) may show a comprehensive error (or pass-through, based on requirements)
155174

156175
## Conformance Statement
157176

@@ -165,6 +184,8 @@ The code used for implementing the various lints and checks will be released to
165184
* UAX31-R3. Pattern_White_Space and Pattern_Syntax Characters: Rust only uses characters from these categories for whitespace and syntax. Other characters may or may not be allowed in identifiers.
166185
* UAX31-R4. Equivalent Normalized Identifiers: All identifiers are normalized according to normalization form C before comparison.
167186

187+
188+
168189
# Drawbacks
169190
[drawbacks]: #drawbacks
170191

@@ -226,6 +247,7 @@ The [Go language][Go] allows identifiers in the form **Letter (Letter | Number)\
226247
* How are non-ASCII idents best supported in debuggers?
227248
* Which name mangling scheme is used by the compiler?
228249
* Is there a better name for the `less_used_codepoints` lint?
250+
* Which lint should the global mixed scripts confusables detection trigger?
229251

230252
[PEP 3131]: https://www.python.org/dev/peps/pep-3131/
231253
[UAX31]: http://www.unicode.org/reports/tr31/
@@ -243,3 +265,5 @@ The [Go language][Go] allows identifiers in the form **Letter (Letter | Number)\
243265
[RFC 0430]: http://rust-lang.github.io/rfcs/0430-finalizing-naming-conventions.html
244266
[TR39Allowed]: https://www.unicode.org/reports/tr39/#General_Security_Profile
245267
[TR39RestrictionLevel]: https://www.unicode.org/reports/tr39/#Restriction_Level_Detection
268+
[unicode-set-confusables]: https://unicode.org/cldr/utility/list-unicodeset.jsp?a=%5B%5B%3AIdentifier_Status%CE%B2%3DAllowed%3A%5D%26%5B%3AXID_Continue%3DYes%3A%5D%26%5B%3AConfMA%CE%B2%3A%5D%5D&g=&i=ConfMA%CE%B2%2CScript_Extensions
269+
[unicode-set-allowed]: https://unicode.org/cldr/utility/list-unicodeset.jsp?a=%5B%5B%3AIdentifier_Status%CE%B2%3DAllowed%3A%5D%26%5B%3AXID_Continue%3DYes%3A%5D%5D&g=&i=Script_Extensions

0 commit comments

Comments
 (0)