-
-
Notifications
You must be signed in to change notification settings - Fork 38
Stop emitting K".." and K"..." in lexer #573
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Assuming you mean terminals of the parse tree here... So - I do agree that IMO, this is actually the main invariant we should aim for when designing token kinds: the parser should never need to look at the text inside a token in order to produce the parse tree. And this is the reason we have things like So my question is - did you consider just So - what made |
Yes
So I do agree that bump_glue is much less bad than bump_split, but on the other hand, I also don't see a particularly strong motivation for it - you can always create a non-terminal that glues these together. I do think there is value in having the token invariant I mentioned for an internal self-consistency check of the parser. My intuition is also that token gluing causes extra work for incremental re-parse to unglue the tokens, but I don't have strong evidence for this. Of course you still get a relaxed version of it if you allow token glueing, but if we don't have to have it, I just fail to see a strong motivation to require it.
Yes, I agree that that's an important constraint on the implementation, but it's not really an invariant of the output data structure. One of the big things I'm trying to nail down is what the data structure invariants of the output are to be able to test structural correctness of the parser (which is a distinct concept from correctness of the language implemented by the parser).
Yes, but, I wanted to avoid using bump_glue more. I also don't like using the same syntax head with two different meaning depending on whether it's a terminal or non-terminal (I know we do that all over the place, but still).
Well, I don't really like the remapping either, but that's for a future PR. |
Unfortunately, the sequences `..` and `...` do not always refer to the `..` operator or the `...` syntax. There are two and a half cases where they don't: 1. After `@` in macrocall, where they are both regular identifiers 2. In `import ...A` where the dots specify the level 3. `:(...)` treats `...` as quoted identifier Case 1 was handled in a previous commit by lexing these as identifiers after `2`. However, as a result of case 2, it is problematic to tokenize these dots together; we essentially have to untokenize them in the import parser. It is also infeasible to change the lexer to have speical context-sensitive lexing in `import`, because there could be arbitrary interpolations, `@eval import A, $(f(x..y)), ..b`, so deciding whether a particular `..` after import refers to the operator or a level specifier requires the parser. Currently the parser handles this by splitting the obtained tokens again in the import parser, but this is undesirable, because it invalidates the invariant that the tokens produced by the lexer correspond to the non-terminals of the final parse tree. This PR attempts to address this by only ever having the lexer emit `K"."` and having the parser decide which case it refers to. The new non-terminal `K"dots"` handles the identifier cases (ordinary `..` and quoted `:(...)` ). K"..." is now exclusively used for splat/slurp, and is no longer used in its non-terminal form for case 3.
Haha I actually find this the most compelling argument for removing it. I do "like" token gluing in the sense that it allows So removing gluing breaks this invariant of the output which we currently have. Maybe that's not something to worry about though. There might be other cases where it would be useful to create more rather than less tokens? For example, operator suffixes;
Seems reasonable - what kind of tests are you imagining?
Yes I agree this feels weird. An alternative I considered in the past is to have different kinds, eg for functions - A reasonable middle ground could be to have a new string macro so we'd have
Julia has many cases where a token might be a keyword or identifier depending on parser context. I don't love this, but I don't know what to do about it. |
If it makes you feel better, you can think of
I actually want to go the other way and remove the operator-specific heads for these - WIP PR coming shortly.
Run the lexer, then run the parser, make sure the token ranges match. Validate that the total number of children matches the total number of nodes, validate the node summation invariants. This should catch the most obvious ways in which you could mess up the parser state - bumping incorrect things into the output or
Keywords used as not-identifiers should always have the trivia flag set, no? |
I guess for this PR the other option is that we can always emit |
Haha! No, I'm afraid it doesn't 😆 I feel like it's strictly worse to have
I agree. There's only one other case I think: negative integer literals. Idk if those are essential either, maybe we could emit them as an interior node as well.
You mean this issue #334 ? I do think we should eliminate all the kinds which are inessential to parsing. The only things we'll have left are precedence classes of operators, plus a few specific operator kinds which are treated specially during parsing, such as So I'm not suggesting adding more kinds for operators. The thought bubble here was that, perhaps, operators such as |
Not to get too distracted about operators - I added a couple more thoughts about reducing the number of kinds into #334 (comment) |
I'll have a PR for this in about 20 mins, so maybe comment there then ;) |
Yes. It's the opposite which is the problem. Consider The tokenizer needs to emit But when it's not used as a keyword, we remap it to I don't know how to avoid this; we can shift the problem around, or use different terminology for it, but it seems unavoidable. Maybe, maybe, we could insert extra tokenizer state or checks to try to get this correct for |
What I'm saying is that whenever a keyword is encountered as non-trivia, it should be treated as an identifier. We already have nodes other than K"Identifier" that are treated as identifiers (K"macro_name", K"var", this would add K"dots"), so whatever higher level API looks at these already needs to be aware of the possibility that there could be non-K"Identifier" identifiers, so why do we need to remap? |
Ok, I understand. But this further complicates the public API of the output data structure in order to satisfy an internal invariant - an invariant that I'm not yet convinced is particularly helpful (maybe you can convince me). My point of view is that, during parsing, the parser has figured out that |
What I'm saying is that any downstream tooling that explicitly looks at K"Identifier" is wrong because there are non K"Identifier" identifiers already (that we cannot remap). Now, conceivably we could restore the property that all identifiers are K"Identifier" and distinguish the various forms with flags, but that's not where we are right now. |
This replaces all the specialized operator heads by a single K"Operator" head that encodes the precedence level in its flags (except for operators that are also used for non-operator purposes). The operators are already K"Identifier" in the final parse tree. There is very little reason to spend all of the extra effort separating them into separate heads only to undo this later. Moreover, I think it's actively misleading, because it makes people think that they can query things about an operator by looking at the head, which doesn't work for suffixed operators. Additionally, this removes the `op=` token, replacing it by two tokens, one K"Operator" with a special precendence level and one `=`. This then removes the last use of `bump_split` (since this PR is on top of #573). As a free bonus this prepares us for having compound assignment syntax for suffixed operators, which was infeasible in the flips parser. That syntax change is not part of this PR but would be trivial (this PR makes it an explicit error). Fixes #334
#575 - not 20 minutes exactly, but I did some extra cleanup |
This replaces all the specialized operator heads by a single K"Operator" head that encodes the precedence level in its flags (except for operators that are also used for non-operator purposes). The operators are already K"Identifier" in the final parse tree. There is very little reason to spend all of the extra effort separating them into separate heads only to undo this later. Moreover, I think it's actively misleading, because it makes people think that they can query things about an operator by looking at the head, which doesn't work for suffixed operators. Additionally, this removes the `op=` token, replacing it by two tokens, one K"Operator" with a special precendence level and one `=`. This then removes the last use of `bump_split` (since this PR is on top of #573). As a free bonus this prepares us for having compound assignment syntax for suffixed operators, which was infeasible in the flips parser. That syntax change is not part of this PR but would be trivial (this PR makes it an explicit error). Fixes #334
Somewhat agree. Typically I write "big switch" style code to process this stuff
In which case it's not wrong to refer to But I appreciate the naming can be a trap for new consumers of the API: they could easily assume |
This replaces all the specialized operator heads by a single K"Operator" head that encodes the precedence level in its flags (except for operators that are also used for non-operator purposes). The operators are already K"Identifier" in the final parse tree. There is very little reason to spend all of the extra effort separating them into separate heads only to undo this later. Moreover, I think it's actively misleading, because it makes people think that they can query things about an operator by looking at the head, which doesn't work for suffixed operators. Additionally, this removes the `op=` token, replacing it by two tokens, one K"Operator" with a special precendence level and one `=`. This then removes the last use of `bump_split` (since this PR is on top of #573). As a free bonus this prepares us for having compound assignment syntax for suffixed operators, which was infeasible in the flips parser. That syntax change is not part of this PR but would be trivial (this PR makes it an explicit error). Fixes #334
My proposal would be to make |
Alright @c42f - how do we move forward on this? I have a mild preference for the |
Unfortunately, the sequences
..
and...
do not always refer to the..
operator or the...
syntax. There are two and a half cases where they don't:@
in macrocall, where they are both regular identifiersimport ...A
where the dots specify the level:(...)
treats...
as quoted identifierCase 1 was handled in a previous commit by lexing these as identifiers after
@
.However, as a result of case 2, it is problematic to tokenize these dots together; we essentially have to untokenize them in the import parser. It is also infeasible to change the lexer to have speical context-sensitive lexing in
import
, because there could be arbitrary interpolations,@eval import A, $(f(x..y)), ..b
, so deciding whether a particular..
after import refers to the operator or a level specifier requires the parser.Currently the parser handles this by splitting the obtained tokens again in the import parser, but this is undesirable, because it invalidates the invariant that the tokens produced by the lexer correspond to the terminals of the final parse tree.
This PR attempts to address this by only ever having the lexer emit
K"."
and having the parser decide which case it refers to. The new non-terminalK"dots"
handles the identifier cases (ordinary..
and quoted:(...)
). K"..." is now exclusively used for splat/slurp, and is no longer used in its non-terminal form for case 3.