Support for string escapes #459

jonshea · 2025-03-17T18:28:53Z

I also slightly changed the definition of interpolated_string by
qualifying the open quotes with token.immediate(…). Prior to this
change the rule would incorrectly match foo "" as an interpolated_string_expression

I ran these changes against all of the .scala files in
https://github.com/scala/scala3 and https://github.com/scala/scala.

The files in scala/scala3 that newly have errors are:

tests/neg/fEscapes.scala
tests/neg/unicodeEscapes-interpolations.scala
tests/pos/multiLineOps.scala
tests/run/i14164.scala

The first two are tests containing examples of invalid escape
sequences that are expected to fail. multiLineOps.scala contains a
line send_! "!" that now parses with an error. Previously this
parsed as an interpolated_string_expression, which is also entirely
incorrect, So this error is a result of the adding
token.immediate('"') to the definition of interpolated_string, and
I do not think the change is a regression. Similarly, i14164.scala
contains a multi-line expression that previously incorrectly parsed to
interpolated_string_expression, and now parses more correctly,
though with an error.

The files in scala/scala that newly have errors are similar. Two
test files with intentionally broken escape sequences, and the same
multiLineOps.scala.

jonshea · 2025-03-17T18:29:19Z

This is my first time working with tree-sitter, so I would appreciate any feedback. I am honestly not certain this feature warrants the complexity of these changes, so it will not hurt my feelings if we ultimately decide not to merge this PR. But if people do want it, I am happy to iterate on any necessary changes.

I split the external tokens for SIMPLE_STRING to have a start / middle / end, and SIMPLE_MULTILINE_STRING to have a start / end. As far as I can tell, there are no escape or other tokens that can be found in a simple multiline string, so it could have stayed as a single token, but I think the code is cleaner if we treat it like the other string types. I also added a raw string and a raw multiline string type. I believe the rules for raw multiline strings are identical to the rules for interpolated multiline strings, so we could merge those two nodes if desired.

This implementation uses the same node name (interpolated_string_expression) for raw strings and interpolated strings, and the same name (escape_sequence) for both escape sequences and dollar sign escapes. Those were arbitrary choices on my part.

jonshea · 2025-03-17T18:32:56Z

grammar.js

    "else",
    "catch",
    "finally",
    "extends",
    "derives",
    "with",
+    $.error_sentinel,


The docs suggest having an error_sentinel token type. We do not use it, but I found it helpful in debugging.

jonshea · 2025-03-17T18:35:18Z

grammar.js

@@ -1616,7 +1622,7 @@ module.exports = grammar({
            choice(
              seq(
                "\\",
-                choice(/[^xu]/, /uu?[0-9a-fA-F]{4}/, /x[0-9a-fA-F]{2}/),
+                choice(/[^xu]/, /[uU]+[0-9a-fA-F]{4}/, /x[0-9a-fA-F]{2}/),


I have a comment noting this elsewhere, but Java weirdly allows any number of us and Us at the start of an escape sequence. https://docs.oracle.com/javase/specs/jls/se7/html/jls-3.html

jonshea · 2025-03-17T18:35:39Z

src/scanner.c

@@ -185,6 +228,24 @@ static inline void debug_indents(Scanner *scanner) {

 bool tree_sitter_scala_external_scanner_scan(void *payload, TSLexer *lexer,
                                             const bool *valid_symbols) {
+  #ifdef DEBUG


I found this very helpful for debugging.

jonshea · 2025-03-17T18:36:53Z

src/scanner.c

@@ -388,30 +449,53 @@ bool tree_sitter_scala_external_scanner_scan(void *payload, TSLexer *lexer,
    skip(lexer);
  }

-  if (valid_symbols[SIMPLE_STRING] && lexer->lookahead == '"') {
+  if (valid_symbols[SIMPLE_STRING_START] && lexer->lookahead == '"') {


Is there a reason we have to scan simple string starts externally instead of just having a normal token, like we do for interpolated strings?

It looks like this has been around for a long time, and was originally added for interpolated string too 2dcb6b0. It's possible that it was implemented that way back then, but we could move " and """ into the parser?

eed3si9n · 2025-03-18T05:53:07Z

This is my first time working with tree-sitter, so I would appreciate any feedback.

I haven't looked into the details, but I'm +1 on adding this feature. Thanks for the contribution!

eed3si9n · 2025-03-18T06:07:50Z

src/scanner.c

+typedef enum {
+  STRING_MODE_SIMPLE,
+  STRING_MODE_INTERPOLATED,
+  STRING_MODE_RAW,
+} StringMode;


Could you add comment to document what this mode represents plz?

eed3si9n · 2025-03-18T06:31:19Z

src/scanner.c

@@ -388,30 +449,53 @@ bool tree_sitter_scala_external_scanner_scan(void *payload, TSLexer *lexer,
    skip(lexer);
  }

-  if (valid_symbols[SIMPLE_STRING] && lexer->lookahead == '"') {
+  if (valid_symbols[SIMPLE_STRING_START] && lexer->lookahead == '"') {


It looks like this has been around for a long time, and was originally added for interpolated string too 2dcb6b0. It's possible that it was implemented that way back then, but we could move " and """ into the parser?

grammar.js

jonshea · 2025-03-20T19:17:11Z

I spot checked some 5 of the 21 files in scala2-compiler that cause the CI failure, and they were all errors caused by a raw string that had a backslash that was not part of a valid escape sequence. For example, val parseVerbose = raw"\* \S+ \S+ \[([^:]+): .*\] .*".r. So, I guess we are back at the raw-string discussion from yesterday. Here are my ideas.

We could decide not to parse escape sequences in any interpolated strings.
We could parse only valid escape sequences in interpolated strings, and treat invalid escape sequences as part of the string instead of treating them like a parse error.
We could go back to treating raw-strings as a special case.

I am in mildly inclined towards 2 or 3, but happy to do whatever the maintainers prefer.

susliko · 2025-03-21T06:30:01Z

a raw string that had a backslash that was not part of a valid escape sequence

I would suggest making escape_sequence rule more relaxed by allowing any character after backslash.

We've encountered a similar situation a few times before. Imo it's okay to have tree-sitter grammar more lenient than actual Scala parser and compiler. I can't see how this approach would hurt main tree-sitter use cases like syntax highlighting, code navigation or repomap building.

We could parse only valid escape sequences in interpolated strings, and treat invalid escape sequences as part of the string instead of treating them like a parse error.

This option is also viable, I think. In case it's not too hard to implement

@eed3si9n wdyt?

eed3si9n · 2025-03-21T11:57:19Z

The primary use case for the tree-sitter grammar is for syntax hilighting, so while I'm all for tree-sitter being more relaxed etc to accommodate Scala 2 and 3 etc, in this example of writing a regex inside the raw-interpolater, it might be misleading to highlight anything.

My suggestion would be to restore the raw special case, but add a comment explaining why it needs special treatment.

susliko · 2025-03-21T17:38:35Z

My suggestion would be to restore the raw special case, but add a comment explaining why it needs special treatment.

Okay, agreed 👍

@jonshea Sorry for the extra work you had to do because of my comment. Thank you for the contribution

jonshea · 2025-03-25T14:28:27Z

@susliko no need to apologize. I had some things I wanted to rework anyway. I reimplemented the raw string parsing. The previous version I had created a parse tree that was pretty different from the parse tree for interpolated_string_expression, and I don’t like that. This new version requires an external scanner for the raw token, but I still think it is overall cleaner, and it allows the raw parse tree to look exactly like the interpolated string parse tree, so highlights should work without change. I also fixed a couple edge cases with backslashes in raw strings that I think I missed the first time.

Do we currently have any way of tracking known syntax that fails to parse? I found a few examples, and I was thinking about making a test/corpus/failing.txt to hold them, with the test cases marked with :skip. Would that be ok? Or is there something else I should do?

eed3si9n · 2025-03-25T16:50:00Z

Do we currently have any way of tracking known syntax that fails to parse? I found a few examples, and I was thinking about making a test/corpus/failing.txt to hold them, with the test cases marked with :skip. Would that be ok? Or is there something else I should do?

That's a great idea. Failing could mean two things:

Pending example (:skip), which should parse but currently doesn't
Negative example (:error), which should not parse and currently doesn't

There are some of them tracked as GitHub issues, but I think it would be cool to have some naming conventions like test/corpus/literals-pending.txt and organize them by areas.

This PR adds support for string escapes, as proposed in tree-sitter#207 I also slightly changed the definition of `interpolated_string` by qualifying the open quotes with `token.immediate(…)`. Prior to this change the rule would incorrectly match `foo ""` as an `interpolated_string_expression` I ran these changes against all of the `.scala` files in https://github.com/scala/scala3 and https://github.com/scala/scala. The files in `scala/scala3` that newly have errors are: * `tests/neg/fEscapes.scala` * `tests/neg/unicodeEscapes-interpolations.scala` * `tests/pos/multiLineOps.scala` * `tests/run/i14164.scala` The first two are tests containing examples of invalid escape sequences that are expected to fail. `multiLineOps.scala` contains a line `send_! "!"` that now parses with an error. Previously this parsed as an `interpolated_string_expression`, which is also entirely incorrect, So this error is a result of the adding `token.immediate('"')` to the definition of `interpolated_string`, and I do not think the change is a regression. Similarly, `i14164.scala` contains a multi-line expression that previously incorrectly parsed to `interpolated_string_expression`, and now parses more correctly, though with an error. The files in `scala/scala` that newly have errors are similar. Two test files with intentionally broken escape sequences, and the same `multiLineOps.scala`.

jonshea force-pushed the jonshea/string-escapes branch from 36a7a57 to e7513d0 Compare March 17, 2025 18:31

jonshea commented Mar 17, 2025

View reviewed changes

eed3si9n approved these changes Mar 18, 2025

View reviewed changes

susliko reviewed Mar 18, 2025

View reviewed changes

grammar.js Outdated Show resolved Hide resolved

susliko approved these changes Mar 20, 2025

View reviewed changes

jonshea force-pushed the jonshea/string-escapes branch from 82d2766 to 2efde4c Compare March 25, 2025 14:17

jonshea marked this pull request as draft March 25, 2025 18:46

jonshea force-pushed the jonshea/string-escapes branch from 2efde4c to ddd2dc0 Compare March 25, 2025 19:17

jonshea marked this pull request as ready for review March 25, 2025 19:17

eed3si9n merged commit 774d1cb into tree-sitter:master Mar 26, 2025
5 checks passed

Support for string escapes #459

Support for string escapes #459

Conversation

jonshea commented Mar 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jonshea commented Mar 17, 2025

Uh oh!

jonshea Mar 17, 2025

Choose a reason for hiding this comment

Uh oh!

jonshea Mar 17, 2025

Choose a reason for hiding this comment

Uh oh!

jonshea Mar 17, 2025

Choose a reason for hiding this comment

Uh oh!

jonshea Mar 17, 2025

Choose a reason for hiding this comment

Uh oh!

eed3si9n Mar 18, 2025

Choose a reason for hiding this comment

Uh oh!

eed3si9n commented Mar 18, 2025

Uh oh!

eed3si9n Mar 18, 2025

Choose a reason for hiding this comment

Uh oh!

eed3si9n Mar 18, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jonshea commented Mar 20, 2025

Uh oh!

susliko commented Mar 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

eed3si9n commented Mar 21, 2025

Uh oh!

susliko commented Mar 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jonshea commented Mar 25, 2025

Uh oh!

eed3si9n commented Mar 25, 2025

Uh oh!

Uh oh!

Uh oh!

jonshea commented Mar 17, 2025 •

edited

Loading

susliko commented Mar 21, 2025 •

edited

Loading

susliko commented Mar 21, 2025 •

edited

Loading