Fix wrong constant/identifier detection in lexer for non-ascii encodings

ivoanjo · ivoanjo · commit 3ec4747b862c · 2020-09-18T19:45:52.000+01:00
During lexing/parsing, a token is determined to be an identifier or a constant by looking at its first byte and determining if it is an uppercase character: Constants start with an uppercase character, identifiers start with a lowercase one. But as previously implemented, only the first byte of a token was used when checking if a character is an upper or lowercase one, which breaks under multibyte encodings, such as UTF-8. For instance, in #2079 a user reported that ```ruby def variance μ = mean reduce(0) { |m, i| m + (i - μ).square } / (length - 1) end ``` caused a `SyntaxError: dynamic constant assignment` because `μ` ([Greek Small Letter Mu](https://www.compart.com/en/unicode/U+03BC)) is encoded with UTF-8 as (206, 188), and `Character.isUpperCase(206) == true`. Thus, the code was considering it a constant, even though the character is considered a lowercase letter by the unicode standard (and thus by MRI). Interestingly enough, the reverse can happen. ```ruby def foo Ⴀ = 1 end ``` The character `Ⴀ` ([Georgian Capital Letter An](https://www.compart.com/en/unicode/U+10A0)) is encoded with UTF-8 as (225, 130, 160) and `isUpperCase(225) == false` making TruffleRuby accept it as an identifier in the example, where MRI rejects it with `SyntaxError: dynamic constant assignment` because it considers it to be a constant. The fix is thus to extract the entire character on multi-byte encodings, and thus the `isUpperCase` check becomes correct. To discuss before merging: * This extracts the first character in a rather heavy-handed way -- it converts the entire `Rope` representation of the token into a `String` instance, and then extracts the `String`'s first character. This is definitely correct but it's a rather expensive conversion just to grab the first character -- especially given that most of the time Ruby sources are only the ASCII subset of UTF-8 and thus are not using multibyte characters. I was digging through the `Rope` implementation and could not find a nice way of getting only a single character that can be used by the lexer/parser. One low-hanging improvement is to add a fast path for ASCII/UTF-8 single-byte characters, and in other cases fall back to the heaver approach. Thoughts? * What's the best way to add a regression test for this? I was looking through the codebase and there doesn't seem to be a a lot of TruffleRuby-specific tests, but I'm not sure if this would fit within ruby/spec. fixes #2079
diff --git a/src/main/java/org/truffleruby/parser/lexer/RubyLexer.java b/src/main/java/org/truffleruby/parser/lexer/RubyLexer.java
@@ -1798,7 +1798,7 @@ private int identifier(int c, boolean commandState) {
             }
             tempVal = createTokenRope();
 
-            if (result == 0 && Character.isUpperCase(tempVal.get(0) & 0xFF)) {
+            if (result == 0 && Character.isUpperCase(tempVal.getString().charAt(0))) {
                 result = RubyParser.tCONSTANT;
             } else {
                 result = RubyParser.tIDENTIFIER;
@@ -3493,7 +3493,7 @@ public boolean update_heredoc_indent(int c) {
     }
 
     public void validateFormalIdentifier(Rope identifier) {
-        int first = identifier.get(0) & 0xFF;
+        int first = identifier.getString().charAt(0);
 
         if (Character.isUpperCase(first)) {
             compile_error("formal argument cannot be a constant");
diff --git a/src/main/java/org/truffleruby/parser/parser/ParserSupport.java b/src/main/java/org/truffleruby/parser/parser/ParserSupport.java
@@ -1459,7 +1459,7 @@ public void warning(SourceIndexLength position, String message) {
 
     // ENEBO: Totally weird naming (in MRI is not allocated and is a local var name) [1.9]
     public boolean is_local_id(Rope name) {
-        return lexer.isIdentifierChar(name.get(0) & 0xFF);
+        return lexer.isIdentifierChar(name.getString().charAt(0));
     }
 
     // 1.9

Original file line number	Diff line number	Diff line change
`@@ -1798,7 +1798,7 @@ private int identifier(int c, boolean commandState) {`
`1798`	`1798`	`}`
`1799`	`1799`	`tempVal = createTokenRope();`
`1800`	`1800`
`1801`		`- if (result == 0 && Character.isUpperCase(tempVal.get(0) & 0xFF)) {`
	`1801`	`+ if (result == 0 && Character.isUpperCase(tempVal.getString().charAt(0))) {`
`1802`	`1802`	`result = RubyParser.tCONSTANT;`
`1803`	`1803`	`} else {`
`1804`	`1804`	`result = RubyParser.tIDENTIFIER;`
`@@ -3493,7 +3493,7 @@ public boolean update_heredoc_indent(int c) {`
`3493`	`3493`	`}`
`3494`	`3494`
`3495`	`3495`	`public void validateFormalIdentifier(Rope identifier) {`
`3496`		`- int first = identifier.get(0) & 0xFF;`
	`3496`	`+ int first = identifier.getString().charAt(0);`
`3497`	`3497`
`3498`	`3498`	`if (Character.isUpperCase(first)) {`
`3499`	`3499`	`compile_error("formal argument cannot be a constant");`
Original file line number	Diff line number	Diff line change
`@@ -1459,7 +1459,7 @@ public void warning(SourceIndexLength position, String message) {`
`1459`	`1459`
`1460`	`1460`	`// ENEBO: Totally weird naming (in MRI is not allocated and is a local var name) [1.9]`
`1461`	`1461`	`public boolean is_local_id(Rope name) {`
`1462`		`- return lexer.isIdentifierChar(name.get(0) & 0xFF);`
	`1462`	`+ return lexer.isIdentifierChar(name.getString().charAt(0));`
`1463`	`1463`	`}`
`1464`	`1464`
`1465`	`1465`	`// 1.9`