Skip to content

Support for string escapes #459

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Mar 26, 2025

Conversation

jonshea
Copy link
Contributor

@jonshea jonshea commented Mar 17, 2025

I also slightly changed the definition of interpolated_string by
qualifying the open quotes with token.immediate(…). Prior to this
change the rule would incorrectly match foo "" as an interpolated_string_expression

I ran these changes against all of the .scala files in
https://github.com/scala/scala3 and https://github.com/scala/scala.

The files in scala/scala3 that newly have errors are:

  • tests/neg/fEscapes.scala
  • tests/neg/unicodeEscapes-interpolations.scala
  • tests/pos/multiLineOps.scala
  • tests/run/i14164.scala

The first two are tests containing examples of invalid escape
sequences that are expected to fail. multiLineOps.scala contains a
line send_! "!" that now parses with an error. Previously this
parsed as an interpolated_string_expression, which is also entirely
incorrect, So this error is a result of the adding
token.immediate('"') to the definition of interpolated_string, and
I do not think the change is a regression. Similarly, i14164.scala
contains a multi-line expression that previously incorrectly parsed to
interpolated_string_expression, and now parses more correctly,
though with an error.

The files in scala/scala that newly have errors are similar. Two
test files with intentionally broken escape sequences, and the same
multiLineOps.scala.

@jonshea
Copy link
Contributor Author

jonshea commented Mar 17, 2025

This is my first time working with tree-sitter, so I would appreciate any feedback. I am honestly not certain this feature warrants the complexity of these changes, so it will not hurt my feelings if we ultimately decide not to merge this PR. But if people do want it, I am happy to iterate on any necessary changes.

I split the external tokens for SIMPLE_STRING to have a start / middle / end, and SIMPLE_MULTILINE_STRING to have a start / end. As far as I can tell, there are no escape or other tokens that can be found in a simple multiline string, so it could have stayed as a single token, but I think the code is cleaner if we treat it like the other string types. I also added a raw string and a raw multiline string type. I believe the rules for raw multiline strings are identical to the rules for interpolated multiline strings, so we could merge those two nodes if desired.

This implementation uses the same node name (interpolated_string_expression) for raw strings and interpolated strings, and the same name (escape_sequence) for both escape sequences and dollar sign escapes. Those were arbitrary choices on my part.

@jonshea jonshea force-pushed the jonshea/string-escapes branch from 36a7a57 to e7513d0 Compare March 17, 2025 18:31
"else",
"catch",
"finally",
"extends",
"derives",
"with",
$.error_sentinel,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docs suggest having an error_sentinel token type. We do not use it, but I found it helpful in debugging.

@@ -1616,7 +1622,7 @@ module.exports = grammar({
choice(
seq(
"\\",
choice(/[^xu]/, /uu?[0-9a-fA-F]{4}/, /x[0-9a-fA-F]{2}/),
choice(/[^xu]/, /[uU]+[0-9a-fA-F]{4}/, /x[0-9a-fA-F]{2}/),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a comment noting this elsewhere, but Java weirdly allows any number of us and Us at the start of an escape sequence. https://docs.oracle.com/javase/specs/jls/se7/html/jls-3.html

@@ -185,6 +228,24 @@ static inline void debug_indents(Scanner *scanner) {

bool tree_sitter_scala_external_scanner_scan(void *payload, TSLexer *lexer,
const bool *valid_symbols) {
#ifdef DEBUG
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found this very helpful for debugging.

@@ -388,30 +449,53 @@ bool tree_sitter_scala_external_scanner_scan(void *payload, TSLexer *lexer,
skip(lexer);
}

if (valid_symbols[SIMPLE_STRING] && lexer->lookahead == '"') {
if (valid_symbols[SIMPLE_STRING_START] && lexer->lookahead == '"') {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason we have to scan simple string starts externally instead of just having a normal token, like we do for interpolated strings?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this has been around for a long time, and was originally added for interpolated string too 2dcb6b0. It's possible that it was implemented that way back then, but we could move " and """ into the parser?

@eed3si9n
Copy link
Collaborator

This is my first time working with tree-sitter, so I would appreciate any feedback.

I haven't looked into the details, but I'm +1 on adding this feature. Thanks for the contribution!

src/scanner.c Outdated
Comment on lines 59 to 63
typedef enum {
STRING_MODE_SIMPLE,
STRING_MODE_INTERPOLATED,
STRING_MODE_RAW,
} StringMode;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add comment to document what this mode represents plz?

@@ -388,30 +449,53 @@ bool tree_sitter_scala_external_scanner_scan(void *payload, TSLexer *lexer,
skip(lexer);
}

if (valid_symbols[SIMPLE_STRING] && lexer->lookahead == '"') {
if (valid_symbols[SIMPLE_STRING_START] && lexer->lookahead == '"') {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this has been around for a long time, and was originally added for interpolated string too 2dcb6b0. It's possible that it was implemented that way back then, but we could move " and """ into the parser?

@jonshea
Copy link
Contributor Author

jonshea commented Mar 20, 2025

I spot checked some 5 of the 21 files in scala2-compiler that cause the CI failure, and they were all errors caused by a raw string that had a backslash that was not part of a valid escape sequence. For example, val parseVerbose = raw"\* \S+ \S+ \[([^:]+): .*\] .*".r. So, I guess we are back at the raw-string discussion from yesterday. Here are my ideas.

  1. We could decide not to parse escape sequences in any interpolated strings.
  2. We could parse only valid escape sequences in interpolated strings, and treat invalid escape sequences as part of the string instead of treating them like a parse error.
  3. We could go back to treating raw-strings as a special case.

I am in mildly inclined towards 2 or 3, but happy to do whatever the maintainers prefer.

@susliko
Copy link
Collaborator

susliko commented Mar 21, 2025

a raw string that had a backslash that was not part of a valid escape sequence

I would suggest making escape_sequence rule more relaxed by allowing any character after backslash.

We've encountered a similar situation a few times before. Imo it's okay to have tree-sitter grammar more lenient than actual Scala parser and compiler. I can't see how this approach would hurt main tree-sitter use cases like syntax highlighting, code navigation or repomap building.

We could parse only valid escape sequences in interpolated strings, and treat invalid escape sequences as part of the string instead of treating them like a parse error.

This option is also viable, I think. In case it's not too hard to implement

@eed3si9n wdyt?

@eed3si9n
Copy link
Collaborator

The primary use case for the tree-sitter grammar is for syntax hilighting, so while I'm all for tree-sitter being more relaxed etc to accommodate Scala 2 and 3 etc, in this example of writing a regex inside the raw-interpolater, it might be misleading to highlight anything.

My suggestion would be to restore the raw special case, but add a comment explaining why it needs special treatment.

@susliko
Copy link
Collaborator

susliko commented Mar 21, 2025

My suggestion would be to restore the raw special case, but add a comment explaining why it needs special treatment.

Okay, agreed 👍

@jonshea Sorry for the extra work you had to do because of my comment. Thank you for the contribution

@jonshea jonshea force-pushed the jonshea/string-escapes branch from 82d2766 to 2efde4c Compare March 25, 2025 14:17
@jonshea
Copy link
Contributor Author

jonshea commented Mar 25, 2025

@susliko no need to apologize. I had some things I wanted to rework anyway. I reimplemented the raw string parsing. The previous version I had created a parse tree that was pretty different from the parse tree for interpolated_string_expression, and I don’t like that. This new version requires an external scanner for the raw token, but I still think it is overall cleaner, and it allows the raw parse tree to look exactly like the interpolated string parse tree, so highlights should work without change. I also fixed a couple edge cases with backslashes in raw strings that I think I missed the first time.

Do we currently have any way of tracking known syntax that fails to parse? I found a few examples, and I was thinking about making a test/corpus/failing.txt to hold them, with the test cases marked with :skip. Would that be ok? Or is there something else I should do?

@eed3si9n
Copy link
Collaborator

Do we currently have any way of tracking known syntax that fails to parse? I found a few examples, and I was thinking about making a test/corpus/failing.txt to hold them, with the test cases marked with :skip. Would that be ok? Or is there something else I should do?

That's a great idea. Failing could mean two things:

  1. Pending example (:skip), which should parse but currently doesn't
  2. Negative example (:error), which should not parse and currently doesn't

There are some of them tracked as GitHub issues, but I think it would be cool to have some naming conventions like test/corpus/literals-pending.txt and organize them by areas.

@jonshea jonshea marked this pull request as draft March 25, 2025 18:46
This PR adds support for string escapes, as proposed in
tree-sitter#207

I also slightly changed the definition of `interpolated_string` by
qualifying the open quotes with `token.immediate(…)`. Prior to this
change the rule would incorrectly match `foo ""` as an `interpolated_string_expression`

I ran these changes against all of the `.scala` files in
https://github.com/scala/scala3 and https://github.com/scala/scala.

The files in `scala/scala3` that newly have errors are:
* `tests/neg/fEscapes.scala`
* `tests/neg/unicodeEscapes-interpolations.scala`
* `tests/pos/multiLineOps.scala`
* `tests/run/i14164.scala`

The first two are tests containing examples of invalid escape
sequences that are expected to fail. `multiLineOps.scala` contains a
line `send_! "!"` that now parses with an error. Previously this
parsed as an `interpolated_string_expression`, which is also entirely
incorrect, So this error is a result of the adding
`token.immediate('"')` to the definition of `interpolated_string`, and
I do not think the change is a regression. Similarly, `i14164.scala`
contains a multi-line expression that previously incorrectly parsed to
`interpolated_string_expression`, and now parses more correctly,
though with an error.

The files in `scala/scala` that newly have errors are similar. Two
test files with intentionally broken escape sequences, and the same
`multiLineOps.scala`.
@jonshea jonshea force-pushed the jonshea/string-escapes branch from 2efde4c to ddd2dc0 Compare March 25, 2025 19:17
@jonshea jonshea marked this pull request as ready for review March 25, 2025 19:17
@eed3si9n eed3si9n merged commit 774d1cb into tree-sitter:master Mar 26, 2025
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants