Remove DOTOP_FLAG from tokenizer and parser #568

Keno · 2025-07-01T01:04:04Z

This implements the first of a series of AST format changes described in #567. In particular, this removes the DOTOP_FLAG. Currently, we already do not emit the DOTOP_FLAG on terminals, they always get split into one dot token and one identifier token (although we did set it on the intermediate tokens that came out of the lexer). The only four kinds for which DOTOP_FLAG was ever set in the final AST were =, op=, && and ||. This introduces separate head kinds for each of these (similar to how there are already separate head calls for dotcall and `dot). Otherwise the AST structure should be unchanged.

This implements the first of a series of AST format changes described in #567. In particular, this removes the DOTOP_FLAG. Currently, we already do not emit the DOTOP_FLAG on terminals, they always get split into one dot token and one identifier token (although we did set it on the intermediate tokens that came out of the lexer). The only four kinds for which DOTOP_FLAG was ever set in the final AST were `=`, `op=`, `&&` and `||`. This introduces separate head kinds for each of these (similar to how there are already separate head calls for `dotcall` and `dot). Otherwise the AST structure should be unchanged.

Keno · 2025-07-01T04:49:02Z

I was waffling a bit on this change - I think at the end I decided to still want to do this, but I figured it'd be good to write down my reasoning for future reference.

I think the primary thing I don't like about this is that it really does seem like something like .== should be a single token at lex time. It's indivisible by whitespace and (with the exception of import syntax) not context sensitive. On the other hand, I do think that (. ==) with the correct parsing for it, since (in julia at least) we compose it out of the more primitive concepts of == and broadcasting. Concretely, I don't think anyone would argue that we should parse broadcasting f. as a single token. Yes, the operator character space is more limited, but it's still infinite and basically an arbitrary identifier (subject to the well-formedness constraints of the operator character set). That said, if we go down the road of merging all the operator heads into a single head, I think it might be reasonable to reserve a flag bit to flag valid operators after . to serve the need of token-based highlighters.

More generally, this has gotten me thinking about the whole concept of token splitting and remapping in general. I think it would be nice if we were able to have an invariant that erasing the tree structure of the parse tree gave us back the token stream. With this PR, there's only two uses of bump_split left: One in the import parsing and one in the handling of op=. However, I think the same reasoning that applies here to dot, basically applies to op= as well, in that I don't think it should be an atomic token syntax head. Then we only have import left, which is basically its own thing anyway and could possibly be its own lex context.

I need to mull this over some more, but that seems like a reasonable direction to me.

Keno · 2025-07-02T05:39:43Z

Then we only have import left, which is basically its own thing anyway and could possibly be its own lex context.

This doesn't work because import can take interpolations, which can have arbitrary syntax in them, so you can't lex it different.

Keno requested a review from mlechu July 1, 2025 01:04

Keno mentioned this pull request Jun 30, 2025

[META] Planned changes to AST structure #567

Open

4 tasks

Keno merged commit 2deac15 into main Jul 2, 2025
36 checks passed

Keno deleted the kf/rmdotflag branch July 2, 2025 00:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Remove DOTOP_FLAG from tokenizer and parser #568

Remove DOTOP_FLAG from tokenizer and parser #568

Uh oh!

Keno commented Jul 1, 2025

Uh oh!

Keno commented Jul 1, 2025

Uh oh!

Uh oh!

Keno commented Jul 2, 2025

Uh oh!

Uh oh!

Uh oh!

Remove DOTOP_FLAG from tokenizer and parser #568

Remove DOTOP_FLAG from tokenizer and parser #568

Uh oh!

Conversation

Keno commented Jul 1, 2025

Uh oh!

Keno commented Jul 1, 2025

Uh oh!

Uh oh!

Keno commented Jul 2, 2025

Uh oh!

Uh oh!