You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: text/0000-non-ascii-idents.md
+25-1Lines changed: 25 additions & 1 deletion
Original file line number
Diff line number
Diff line change
@@ -117,6 +117,8 @@ The lint is triggered by identifiers that contain a codepoint that is not part o
117
117
118
118
Note: New Unicode versions update the set of allowed codepoints. Additionally the compiler authors may decide to allow more codepoints or warn about those that have been found to cause confusion.
119
119
120
+
For reference, a list of all the code points allowed by this lint can be found [here][unicode-set-allowed], with the script group mentioned on the right.
121
+
120
122
## Mixed script detection
121
123
122
124
A new `mixed_script_idents` lint is added to the compiler. The default setting is to `warn`.
@@ -125,6 +127,23 @@ The lint is triggered by identifiers that do not qualify for the "Moderately Res
125
127
126
128
Note: The definition of "Moderately Restrictive" can be changed by future versions of the Unicode standard to reflect changes in the natural languages used or for other reasons.
127
129
130
+
## Global mixed script detection with confusables
131
+
132
+
As an additional measure, we try to detect cases where a codebase primarily using a certain script has identifiers from a different script confusable with that script.
133
+
134
+
During `mixed_script_idents` computation, keep track of how often identifiers from various script groups crop up. If an identifier is from a less-common script group (say, <1% of identifiers), _and_ it is entirely confusable with the majority script in use (e.g. the string `"арр"` or `"роре"` in Cyrillic)
135
+
136
+
This can trigger `confusable_idents`, `mixed_script_idents`, or a new lint.
137
+
138
+
We identify sets of characters which are entirely confusable: For example, for Cyrillic-Latin, we have `а, е, о, р, с, у, х, ѕ, і, ј, ԛ, ԝ, ѐ, ё, ї, ӱ, ӧ, ӓ, ӕ, ӑ` amongst the lowercase letters (and more amongst the capitals). This list likely can be programmatically derived from the confusables data that Unicode already has. It may be worth filtering for exact confusables. For example, Cyrillic, Greek, and Latin have a lot of confusables that are almost indistinguishable in most fonts, whereas `ھ` and `ס` are noticeably different-looking from `o` even though they're marked as a confusables.
139
+
140
+
The main confusable script pairs we have to worry about are Cyrillic/Latin/Greek, Armenian/Ethiopic, and a couple Armenian characters mapping to Greek/Latin. We can implement this lint conservatively at first by dealing with a blacklist of known confusables for these script pairs, and expand it if there is a need.
141
+
142
+
There are many confusables _within_ scripts -- Arabic has a bunch of these as does Han (both with other Han characters and and with kana), but since these are within the same language group this is outside the scope of this RFC. Such confusables are equivalent to `l` vs `I` being confusable in some fonts.
143
+
144
+
For reference, a list of all possible Rust identifier characters that do not trip `less_used_codepoints` but have confusables can be found [here][unicode-set-confusables], with their confusable skeleton and script group mentioned on the right. Note that in many cases the confusables are visually distinguishable, or are diacritic marks.
145
+
146
+
128
147
## Adjustments to the "bad style" lints
129
148
130
149
Rust [RFC 0430] establishes naming conventions for Rust ASCII identifiers. The *rustc* compiler includes lints to promote these recommendations.
@@ -151,7 +170,7 @@ The code used for implementing the various lints and checks will be released to
151
170
- Script identification and comparison for `mixed_script_detection` ([UTS #39 Section 5.2][TR39RestrictionLevel])
152
171
-`skeleton(X)` algorithm for confusable detection ([UTS #39 Section 4][TR39Confusable])
153
172
154
-
173
+
Confusables detection works well when there are other identifiers to compare against, but in some cases there's only one instance of an identifier in the code. For example we have crates that use proc macros to expose command line options or REST endpoints. Crates that do things like these can use such algorithms to ensure better error handling; for example if we accidentally end up having an `/арр` endpoint (in Cyrillic) because of a `#[annotation] fn арр()`, visiting `/app` (in Latin) may show a comprehensive error (or pass-through, based on requirements)
155
174
156
175
## Conformance Statement
157
176
@@ -165,6 +184,8 @@ The code used for implementing the various lints and checks will be released to
165
184
* UAX31-R3. Pattern_White_Space and Pattern_Syntax Characters: Rust only uses characters from these categories for whitespace and syntax. Other characters may or may not be allowed in identifiers.
166
185
* UAX31-R4. Equivalent Normalized Identifiers: All identifiers are normalized according to normalization form C before comparison.
167
186
187
+
188
+
168
189
# Drawbacks
169
190
[drawbacks]: #drawbacks
170
191
@@ -226,6 +247,7 @@ The [Go language][Go] allows identifiers in the form **Letter (Letter | Number)\
226
247
* How are non-ASCII idents best supported in debuggers?
227
248
* Which name mangling scheme is used by the compiler?
228
249
* Is there a better name for the `less_used_codepoints` lint?
250
+
* Which lint should the global mixed scripts confusables detection trigger?
0 commit comments