-
Notifications
You must be signed in to change notification settings - Fork 59
Support for string escapes #459
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This is my first time working with tree-sitter, so I would appreciate any feedback. I am honestly not certain this feature warrants the complexity of these changes, so it will not hurt my feelings if we ultimately decide not to merge this PR. But if people do want it, I am happy to iterate on any necessary changes. I split the external tokens for This implementation uses the same node name ( |
36a7a57
to
e7513d0
Compare
"else", | ||
"catch", | ||
"finally", | ||
"extends", | ||
"derives", | ||
"with", | ||
$.error_sentinel, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The docs suggest having an error_sentinel
token type. We do not use it, but I found it helpful in debugging.
@@ -1616,7 +1622,7 @@ module.exports = grammar({ | |||
choice( | |||
seq( | |||
"\\", | |||
choice(/[^xu]/, /uu?[0-9a-fA-F]{4}/, /x[0-9a-fA-F]{2}/), | |||
choice(/[^xu]/, /[uU]+[0-9a-fA-F]{4}/, /x[0-9a-fA-F]{2}/), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a comment noting this elsewhere, but Java weirdly allows any number of u
s and U
s at the start of an escape sequence. https://docs.oracle.com/javase/specs/jls/se7/html/jls-3.html
@@ -185,6 +228,24 @@ static inline void debug_indents(Scanner *scanner) { | |||
|
|||
bool tree_sitter_scala_external_scanner_scan(void *payload, TSLexer *lexer, | |||
const bool *valid_symbols) { | |||
#ifdef DEBUG |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found this very helpful for debugging.
@@ -388,30 +449,53 @@ bool tree_sitter_scala_external_scanner_scan(void *payload, TSLexer *lexer, | |||
skip(lexer); | |||
} | |||
|
|||
if (valid_symbols[SIMPLE_STRING] && lexer->lookahead == '"') { | |||
if (valid_symbols[SIMPLE_STRING_START] && lexer->lookahead == '"') { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason we have to scan simple string starts externally instead of just having a normal token, like we do for interpolated strings?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like this has been around for a long time, and was originally added for interpolated string too 2dcb6b0. It's possible that it was implemented that way back then, but we could move "
and """
into the parser?
I haven't looked into the details, but I'm +1 on adding this feature. Thanks for the contribution! |
src/scanner.c
Outdated
typedef enum { | ||
STRING_MODE_SIMPLE, | ||
STRING_MODE_INTERPOLATED, | ||
STRING_MODE_RAW, | ||
} StringMode; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add comment to document what this mode represents plz?
@@ -388,30 +449,53 @@ bool tree_sitter_scala_external_scanner_scan(void *payload, TSLexer *lexer, | |||
skip(lexer); | |||
} | |||
|
|||
if (valid_symbols[SIMPLE_STRING] && lexer->lookahead == '"') { | |||
if (valid_symbols[SIMPLE_STRING_START] && lexer->lookahead == '"') { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like this has been around for a long time, and was originally added for interpolated string too 2dcb6b0. It's possible that it was implemented that way back then, but we could move "
and """
into the parser?
I spot checked some 5 of the 21 files in scala2-compiler that cause the CI failure, and they were all errors caused by a raw string that had a backslash that was not part of a valid escape sequence. For example,
I am in mildly inclined towards 2 or 3, but happy to do whatever the maintainers prefer. |
I would suggest making We've encountered a similar situation a few times before. Imo it's okay to have tree-sitter grammar more lenient than actual Scala parser and compiler. I can't see how this approach would hurt main tree-sitter use cases like syntax highlighting, code navigation or repomap building.
This option is also viable, I think. In case it's not too hard to implement @eed3si9n wdyt? |
The primary use case for the tree-sitter grammar is for syntax hilighting, so while I'm all for tree-sitter being more relaxed etc to accommodate Scala 2 and 3 etc, in this example of writing a regex inside the raw-interpolater, it might be misleading to highlight anything. My suggestion would be to restore the raw special case, but add a comment explaining why it needs special treatment. |
Okay, agreed 👍 @jonshea Sorry for the extra work you had to do because of my comment. Thank you for the contribution |
82d2766
to
2efde4c
Compare
@susliko no need to apologize. I had some things I wanted to rework anyway. I reimplemented the raw string parsing. The previous version I had created a parse tree that was pretty different from the parse tree for Do we currently have any way of tracking known syntax that fails to parse? I found a few examples, and I was thinking about making a |
That's a great idea. Failing could mean two things:
There are some of them tracked as GitHub issues, but I think it would be cool to have some naming conventions like |
This PR adds support for string escapes, as proposed in tree-sitter#207 I also slightly changed the definition of `interpolated_string` by qualifying the open quotes with `token.immediate(…)`. Prior to this change the rule would incorrectly match `foo ""` as an `interpolated_string_expression` I ran these changes against all of the `.scala` files in https://github.com/scala/scala3 and https://github.com/scala/scala. The files in `scala/scala3` that newly have errors are: * `tests/neg/fEscapes.scala` * `tests/neg/unicodeEscapes-interpolations.scala` * `tests/pos/multiLineOps.scala` * `tests/run/i14164.scala` The first two are tests containing examples of invalid escape sequences that are expected to fail. `multiLineOps.scala` contains a line `send_! "!"` that now parses with an error. Previously this parsed as an `interpolated_string_expression`, which is also entirely incorrect, So this error is a result of the adding `token.immediate('"')` to the definition of `interpolated_string`, and I do not think the change is a regression. Similarly, `i14164.scala` contains a multi-line expression that previously incorrectly parsed to `interpolated_string_expression`, and now parses more correctly, though with an error. The files in `scala/scala` that newly have errors are similar. Two test files with intentionally broken escape sequences, and the same `multiLineOps.scala`.
2efde4c
to
ddd2dc0
Compare
I also slightly changed the definition of
interpolated_string
byqualifying the open quotes with
token.immediate(…)
. Prior to thischange the rule would incorrectly match
foo ""
as aninterpolated_string_expression
I ran these changes against all of the
.scala
files inhttps://github.com/scala/scala3 and https://github.com/scala/scala.
The files in
scala/scala3
that newly have errors are:tests/neg/fEscapes.scala
tests/neg/unicodeEscapes-interpolations.scala
tests/pos/multiLineOps.scala
tests/run/i14164.scala
The first two are tests containing examples of invalid escape
sequences that are expected to fail.
multiLineOps.scala
contains aline
send_! "!"
that now parses with an error. Previously thisparsed as an
interpolated_string_expression
, which is also entirelyincorrect, So this error is a result of the adding
token.immediate('"')
to the definition ofinterpolated_string
, andI do not think the change is a regression. Similarly,
i14164.scala
contains a multi-line expression that previously incorrectly parsed to
interpolated_string_expression
, and now parses more correctly,though with an error.
The files in
scala/scala
that newly have errors are similar. Twotest files with intentionally broken escape sequences, and the same
multiLineOps.scala
.