Skip to content

Stop emitting K".." and K"..." in lexer #573

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

Stop emitting K".." and K"..." in lexer #573

wants to merge 1 commit into from

Conversation

Keno
Copy link
Member

@Keno Keno commented Jul 10, 2025

Unfortunately, the sequences .. and ... do not always refer to the .. operator or the ... syntax. There are two and a half cases where they don't:

  1. After @ in macrocall, where they are both regular identifiers
  2. In import ...A where the dots specify the level
  3. :(...) treats ... as quoted identifier

Case 1 was handled in a previous commit by lexing these as identifiers after @.

However, as a result of case 2, it is problematic to tokenize these dots together; we essentially have to untokenize them in the import parser. It is also infeasible to change the lexer to have speical context-sensitive lexing in import, because there could be arbitrary interpolations, @eval import A, $(f(x..y)), ..b, so deciding whether a particular .. after import refers to the operator or a level specifier requires the parser.

Currently the parser handles this by splitting the obtained tokens again in the import parser, but this is undesirable, because it invalidates the invariant that the tokens produced by the lexer correspond to the terminals of the final parse tree.

This PR attempts to address this by only ever having the lexer emit K"." and having the parser decide which case it refers to. The new non-terminal K"dots" handles the identifier cases (ordinary .. and quoted :(...) ). K"..." is now exclusively used for splat/slurp, and is no longer used in its non-terminal form for case 3.

@Keno Keno requested review from c42f and mlechu July 10, 2025 02:18
@c42f
Copy link
Member

c42f commented Jul 10, 2025

Currently the parser handles this by splitting the obtained tokens again in the import parser, but this is undesirable, because it invalidates the invariant that the tokens produced by the lexer correspond to the non-terminals of the final parse tree.

Assuming you mean terminals of the parse tree here...

So - I do agree that bump_split is super ugly and it's great to delete it. On the other hand, I have much less of an issue with bump_glue(), because pasting tokens together doesn't feel like peeking at the text inside them (we used to do some of this ... unsure if there's any left, and admittedly it doesn't technically apply to K"...")

IMO, this is actually the main invariant we should aim for when designing token kinds: the parser should never need to look at the text inside a token in order to produce the parse tree. And this is the reason we have things like K"doc" as a psuedo-keyword even though it's not exactly a keyword, @doc does still have special whitespace rules (unfortunately?! heh)

So my question is - did you consider just bump_glue() to glue the K"." tokens together in the parser (into K"...", for example) rather than a K"dots" nonterminal? To my mind that is somehow more appealing and feels similar in spirit to all the remap_kind we do, where we second guess the tokenizer quite a lot.

So - what made K"dots" more appealing to you? Is it actually important that we try to delete bump_glue()?

Base automatically changed from kf/macroname to main July 10, 2025 18:08
@Keno
Copy link
Member Author

Keno commented Jul 10, 2025

Assuming you mean terminals of the parse tree here...

Yes

So - I do agree that bump_split is super ugly and it's great to delete it. On the other hand, I have much less of an issue with bump_glue(), because pasting tokens together doesn't feel like peeking at the text inside them (we used to do some of this ... unsure if there's any left, and admittedly it doesn't technically apply to K"...")

So I do agree that bump_glue is much less bad than bump_split, but on the other hand, I also don't see a particularly strong motivation for it - you can always create a non-terminal that glues these together. I do think there is value in having the token invariant I mentioned for an internal self-consistency check of the parser. My intuition is also that token gluing causes extra work for incremental re-parse to unglue the tokens, but I don't have strong evidence for this. Of course you still get a relaxed version of it if you allow token glueing, but if we don't have to have it, I just fail to see a strong motivation to require it.

IMO, this is actually the main invariant we should aim for when designing token kinds: the parser should never need to look at the text inside a token in order to produce the parse tree.

Yes, I agree that that's an important constraint on the implementation, but it's not really an invariant of the output data structure. One of the big things I'm trying to nail down is what the data structure invariants of the output are to be able to test structural correctness of the parser (which is a distinct concept from correctness of the language implemented by the parser).

So my question is - did you consider just bump_glue() to glue the K"." tokens together in the parser (into K"...", for example) rather than a K"dots" nonterminal?

Yes, but, I wanted to avoid using bump_glue more. I also don't like using the same syntax head with two different meaning depending on whether it's a terminal or non-terminal (I know we do that all over the place, but still).

To my mind that is somehow more appealing and feels similar in spirit to all the remap_kind we do, where we second guess the tokenizer quite a lot.

Well, I don't really like the remapping either, but that's for a future PR.

Unfortunately, the sequences `..` and `...` do not always refer
to the `..` operator or the `...` syntax. There are two and a half cases
where they don't:

1. After `@` in macrocall, where they are both regular identifiers
2. In `import ...A` where the dots specify the level
3. `:(...)` treats `...` as quoted identifier

Case 1 was handled in a previous commit by lexing these as identifiers
after `2`.

However, as a result of case 2, it is problematic to tokenize these dots
together; we essentially have to untokenize them in the import parser. It
is also infeasible to change the lexer to have speical context-sensitive
lexing in `import`, because there could be arbitrary interpolations,
`@eval import A, $(f(x..y)), ..b`, so deciding whether a particular
`..` after import refers to the operator or a level specifier requires
the parser.

Currently the parser handles this by splitting the obtained tokens
again in the import parser, but this is undesirable, because it
invalidates the invariant that the tokens produced by the lexer
correspond to the non-terminals of the final parse tree.

This PR attempts to address this by only ever having the lexer emit
`K"."` and having the parser decide which case it refers to.
The new non-terminal `K"dots"` handles the identifier cases (ordinary
`..` and quoted `:(...)` ). K"..." is now exclusively used for
splat/slurp, and is no longer used in its non-terminal form for
case 3.
@c42f
Copy link
Member

c42f commented Jul 10, 2025

but if we don't have to have it, I just fail to see a strong motivation to require it

Haha I actually find this the most compelling argument for removing it.

I do "like" token gluing in the sense that it allows .. to be a single token, and that no other operators are more than one token, currently (I think?)

So removing gluing breaks this invariant of the output which we currently have. Maybe that's not something to worry about though.

There might be other cases where it would be useful to create more rather than less tokens? For example, operator suffixes; +₁ for example: the tokenizer currently treats suffixes in one place independent from the operator start character, but then mashes them together with the operator start character into a single token. It's possible these would make sense as two separate tokens.

to be able to test structural correctness of the parser

Seems reasonable - what kind of tests are you imagining?

I also don't like using the same syntax head with two different meaning depending on whether it's a terminal or non-terminal

Yes I agree this feels weird.

An alternative I considered in the past is to have different kinds, eg for functions - K"function" and K"function_token" or something, but this breaks the property that the head names are "self-representing" in the way they are named in the source code and I didn't want to loose that.

A reasonable middle ground could be to have a new string macro so we'd have K"function" for the internal node and KT"function" for the token.

Well, I don't really like the remapping either, but that's for a future PR.

Julia has many cases where a token might be a keyword or identifier depending on parser context. I don't love this, but I don't know what to do about it.

@Keno
Copy link
Member Author

Keno commented Jul 10, 2025

I do "like" token gluing in the sense that it allows .. to be a single token, and that no other operators are more than one token, currently (I think?)

If it makes you feel better, you can think of .. as dotted . ;). But yes, I agree the .. thing is a little ugly, but it also seems like too much of a corner case to add a whole token gluing feature for.

There might be other cases where it would be useful to create more rather than less tokens? For example, operator suffixes; +₁ for example: the tokenizer currently treats suffixes in one place independent from the operator start character, but then mashes them together with the operator start character into a single token. It's possible these would make sense as two separate tokens.

I actually want to go the other way and remove the operator-specific heads for these - WIP PR coming shortly.

Seems reasonable - what kind of tests are you imagining?

Run the lexer, then run the parser, make sure the token ranges match. Validate that the total number of children matches the total number of nodes, validate the node summation invariants. This should catch the most obvious ways in which you could mess up the parser state - bumping incorrect things into the output or emit'ing something with a mark that isn't actually valid.

Julia has many cases where a token might be a keyword or identifier depending on parser context. I don't love this, but I don't know what to do about it.

Keywords used as not-identifiers should always have the trivia flag set, no?

@Keno
Copy link
Member Author

Keno commented Jul 10, 2025

I guess for this PR the other option is that we can always emit ... as K".", K"..", treat these as trivia and use the numeric flags for the ascension depth in importpath - that would let us retain K".." as a terminal while also getting rid of the splitting.

@c42f
Copy link
Member

c42f commented Jul 11, 2025

If it makes you feel better, you can think of .. as dotted . ;).

Haha! No, I'm afraid it doesn't 😆

I feel like it's strictly worse to have K"." and K".." split in an ad hoc way which "just happens" to work. If we're doing individual K"." it's nice and systematic to always just emit a sequence of them and have the parser group them as necessary.

seems like too much of a corner case to add a whole token gluing feature for.

I agree. There's only one other case I think: negative integer literals. Idk if those are essential either, maybe we could emit them as an interior node as well.

I actually want to go the other way and remove the operator-specific heads for these - WIP PR coming shortly.

You mean this issue #334 ?

I do think we should eliminate all the kinds which are inessential to parsing. The only things we'll have left are precedence classes of operators, plus a few specific operator kinds which are treated specially during parsing, such as K"+" for parse_with_chains().

So I'm not suggesting adding more kinds for operators. The thought bubble here was that, perhaps, operators such as ^₁₁ might make sense to be tokenized as "^"::PowerOperator "₁₁"::OpSuffix or some such. I don't know if this makes sense, though.

@c42f
Copy link
Member

c42f commented Jul 11, 2025

Not to get too distracted about operators - I added a couple more thoughts about reducing the number of kinds into #334 (comment)

@Keno
Copy link
Member Author

Keno commented Jul 11, 2025

Not to get too distracted about operators - I added a couple more thoughts about reducing the number of kinds into #334 (comment)

I'll have a PR for this in about 20 mins, so maybe comment there then ;)

@c42f
Copy link
Member

c42f commented Jul 11, 2025

Julia has many cases where a token might be a keyword or identifier depending on parser context. I don't love this, but I don't know what to do about it.

Keywords used as not-identifiers should always have the trivia flag set, no?

Yes.

It's the opposite which is the problem.

Consider outer = 100

The tokenizer needs to emit K"outer", or equivalent, so that this context-dependent keyword can be detected in cases when it's a keyword.

But when it's not used as a keyword, we remap it to K"Identifier". The parser needs to make that decision and I don't think we can or should try to avoid that (the way to avoid this cleanly imo, it to just not allow such context-dependent keywords in the language, but I guess that ship sailed a long time ago).

I don't know how to avoid this; we can shift the problem around, or use different terminology for it, but it seems unavoidable. Maybe, maybe, we could insert extra tokenizer state or checks to try to get this correct for outer, but the parser already has all the context it needs, and complex tokenizer state hacks are very ugly and play badly with parser recovery. (The stuff around tokenizing string interpolations is extremely evil IMO)

@Keno
Copy link
Member Author

Keno commented Jul 11, 2025

The tokenizer needs to emit K"outer", or equivalent, so that this context-dependent keyword can be detected in cases when it's a keyword.

What I'm saying is that whenever a keyword is encountered as non-trivia, it should be treated as an identifier. We already have nodes other than K"Identifier" that are treated as identifiers (K"macro_name", K"var", this would add K"dots"), so whatever higher level API looks at these already needs to be aware of the possibility that there could be non-K"Identifier" identifiers, so why do we need to remap?

@c42f
Copy link
Member

c42f commented Jul 11, 2025

What I'm saying is that whenever a keyword is encountered as non-trivia, it should be treated as an identifier.

Ok, I understand.

But this further complicates the public API of the output data structure in order to satisfy an internal invariant - an invariant that I'm not yet convinced is particularly helpful (maybe you can convince me).

My point of view is that, during parsing, the parser has figured out that K"outer" is in Identifier-position and should only ever be treated as a normal K"Identifier" by downstream tooling. So that decision should be clearly and simply recorded. Otherwise we force the consumers of our API to figure this out again and cater to more special cases.

@Keno
Copy link
Member Author

Keno commented Jul 11, 2025

K"Identifier" by downstream tooling

What I'm saying is that any downstream tooling that explicitly looks at K"Identifier" is wrong because there are non K"Identifier" identifiers already (that we cannot remap). Now, conceivably we could restore the property that all identifiers are K"Identifier" and distinguish the various forms with flags, but that's not where we are right now.

Keno added a commit that referenced this pull request Jul 11, 2025
This replaces all the specialized operator heads by a single K"Operator"
head that encodes the precedence level in its flags (except for operators
that are also used for non-operator purposes). The operators are
already K"Identifier" in the final parse tree. There is very little
reason to spend all of the extra effort separating them into separate
heads only to undo this later. Moreover, I think it's actively misleading,
because it makes people think that they can query things about an operator
by looking at the head, which doesn't work for suffixed operators.

Additionally, this removes the `op=` token, replacing it by two tokens,
one K"Operator" with a special precendence level and one `=`. This then
removes the last use of `bump_split` (since this PR is on top of #573).

As a free bonus this prepares us for having compound assignment syntax
for suffixed operators, which was infeasible in the flips parser. That
syntax change is not part of this PR but would be trivial (this PR
makes it an explicit error).

Fixes #334
@Keno
Copy link
Member Author

Keno commented Jul 11, 2025

I'll have a PR for this in about 20 mins, so maybe comment there then ;)

#575 - not 20 minutes exactly, but I did some extra cleanup

Keno added a commit that referenced this pull request Jul 11, 2025
This replaces all the specialized operator heads by a single K"Operator"
head that encodes the precedence level in its flags (except for operators
that are also used for non-operator purposes). The operators are
already K"Identifier" in the final parse tree. There is very little
reason to spend all of the extra effort separating them into separate
heads only to undo this later. Moreover, I think it's actively misleading,
because it makes people think that they can query things about an operator
by looking at the head, which doesn't work for suffixed operators.

Additionally, this removes the `op=` token, replacing it by two tokens,
one K"Operator" with a special precendence level and one `=`. This then
removes the last use of `bump_split` (since this PR is on top of #573).

As a free bonus this prepares us for having compound assignment syntax
for suffixed operators, which was infeasible in the flips parser. That
syntax change is not part of this PR but would be trivial (this PR
makes it an explicit error).

Fixes #334
@c42f
Copy link
Member

c42f commented Jul 11, 2025

What I'm saying is that any downstream tooling that explicitly looks at K"Identifier" is wrong because there are non K"Identifier" identifiers already

Somewhat agree. Typically I write "big switch" style code to process this stuff

if k == K"Identifier"
 ....
elseif k == K"var"
...
...
end

In which case it's not wrong to refer to K"Identifier" explicitly.

But I appreciate the naming can be a trap for new consumers of the API: they could easily assume K"Identifier" is the only kind of thing which can be an identifier. It would be nice to make this trap harder to fall into.

Keno added a commit that referenced this pull request Jul 11, 2025
This replaces all the specialized operator heads by a single K"Operator"
head that encodes the precedence level in its flags (except for operators
that are also used for non-operator purposes). The operators are
already K"Identifier" in the final parse tree. There is very little
reason to spend all of the extra effort separating them into separate
heads only to undo this later. Moreover, I think it's actively misleading,
because it makes people think that they can query things about an operator
by looking at the head, which doesn't work for suffixed operators.

Additionally, this removes the `op=` token, replacing it by two tokens,
one K"Operator" with a special precendence level and one `=`. This then
removes the last use of `bump_split` (since this PR is on top of #573).

As a free bonus this prepares us for having compound assignment syntax
for suffixed operators, which was infeasible in the flips parser. That
syntax change is not part of this PR but would be trivial (this PR
makes it an explicit error).

Fixes #334
@Keno
Copy link
Member Author

Keno commented Jul 11, 2025

But I appreciate the naming can be a trap for new consumers of the API: they could easily assume K"Identifier" is the only kind of thing which can be an identifier. It would be nice to make this trap harder to fall into.

My proposal would be to make KSet"Identifier" work and cover all identifier-like things and rename K"Identifier" to something like K"PlainIdentifier" or K"RawIdentifier" and tell people to use the former.

@Keno
Copy link
Member Author

Keno commented Jul 15, 2025

Alright @c42f - how do we move forward on this? I have a mild preference for the dots head, but if you absolutely don't like it, I can replace it with bump_glue for the time being while we figure out the larger questions on the nature of identifiers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants