Skip to content

Commit 3ec4747

Browse files
committed
Fix wrong constant/identifier detection in lexer for non-ascii encodings
During lexing/parsing, a token is determined to be an identifier or a constant by looking at its first byte and determining if it is an uppercase character: Constants start with an uppercase character, identifiers start with a lowercase one. But as previously implemented, only the first byte of a token was used when checking if a character is an upper or lowercase one, which breaks under multibyte encodings, such as UTF-8. For instance, in #2079 a user reported that ```ruby def variance μ = mean reduce(0) { |m, i| m + (i - μ).square } / (length - 1) end ``` caused a `SyntaxError: dynamic constant assignment` because `μ` ([Greek Small Letter Mu](https://www.compart.com/en/unicode/U+03BC)) is encoded with UTF-8 as (206, 188), and `Character.isUpperCase(206) == true`. Thus, the code was considering it a constant, even though the character is considered a lowercase letter by the unicode standard (and thus by MRI). Interestingly enough, the reverse can happen. ```ruby def foo Ⴀ = 1 end ``` The character `Ⴀ` ([Georgian Capital Letter An](https://www.compart.com/en/unicode/U+10A0)) is encoded with UTF-8 as (225, 130, 160) and `isUpperCase(225) == false` making TruffleRuby accept it as an identifier in the example, where MRI rejects it with `SyntaxError: dynamic constant assignment` because it considers it to be a constant. The fix is thus to extract the entire character on multi-byte encodings, and thus the `isUpperCase` check becomes correct. To discuss before merging: * This extracts the first character in a rather heavy-handed way -- it converts the entire `Rope` representation of the token into a `String` instance, and then extracts the `String`'s first character. This is definitely correct but it's a rather expensive conversion just to grab the first character -- especially given that most of the time Ruby sources are only the ASCII subset of UTF-8 and thus are not using multibyte characters. I was digging through the `Rope` implementation and could not find a nice way of getting only a single character that can be used by the lexer/parser. One low-hanging improvement is to add a fast path for ASCII/UTF-8 single-byte characters, and in other cases fall back to the heaver approach. Thoughts? * What's the best way to add a regression test for this? I was looking through the codebase and there doesn't seem to be a a lot of TruffleRuby-specific tests, but I'm not sure if this would fit within ruby/spec. fixes #2079
1 parent 2cb0520 commit 3ec4747

File tree

2 files changed

+3
-3
lines changed

2 files changed

+3
-3
lines changed

src/main/java/org/truffleruby/parser/lexer/RubyLexer.java

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1798,7 +1798,7 @@ private int identifier(int c, boolean commandState) {
17981798
}
17991799
tempVal = createTokenRope();
18001800

1801-
if (result == 0 && Character.isUpperCase(tempVal.get(0) & 0xFF)) {
1801+
if (result == 0 && Character.isUpperCase(tempVal.getString().charAt(0))) {
18021802
result = RubyParser.tCONSTANT;
18031803
} else {
18041804
result = RubyParser.tIDENTIFIER;
@@ -3493,7 +3493,7 @@ public boolean update_heredoc_indent(int c) {
34933493
}
34943494

34953495
public void validateFormalIdentifier(Rope identifier) {
3496-
int first = identifier.get(0) & 0xFF;
3496+
int first = identifier.getString().charAt(0);
34973497

34983498
if (Character.isUpperCase(first)) {
34993499
compile_error("formal argument cannot be a constant");

src/main/java/org/truffleruby/parser/parser/ParserSupport.java

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1459,7 +1459,7 @@ public void warning(SourceIndexLength position, String message) {
14591459

14601460
// ENEBO: Totally weird naming (in MRI is not allocated and is a local var name) [1.9]
14611461
public boolean is_local_id(Rope name) {
1462-
return lexer.isIdentifierChar(name.get(0) & 0xFF);
1462+
return lexer.isIdentifierChar(name.getString().charAt(0));
14631463
}
14641464

14651465
// 1.9

0 commit comments

Comments
 (0)