Skip to content

Remove DOTOP_FLAG from tokenizer and parser #568

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jul 2, 2025
Merged

Remove DOTOP_FLAG from tokenizer and parser #568

merged 1 commit into from
Jul 2, 2025

Conversation

Keno
Copy link
Member

@Keno Keno commented Jul 1, 2025

This implements the first of a series of AST format changes described in #567. In particular, this removes the DOTOP_FLAG. Currently, we already do not emit the DOTOP_FLAG on terminals, they always get split into one dot token and one identifier token (although we did set it on the intermediate tokens that came out of the lexer). The only four kinds for which DOTOP_FLAG was ever set in the final AST were =, op=, && and ||. This introduces separate head kinds for each of these (similar to how there are already separate head calls for dotcall and `dot). Otherwise the AST structure should be unchanged.

This implements the first of a series of AST format changes described
in #567. In particular, this removes the DOTOP_FLAG. Currently, we already
do not emit the DOTOP_FLAG on terminals, they always get split into one
dot token and one identifier token (although we did set it on the intermediate
tokens that came out of the lexer). The only four kinds for which DOTOP_FLAG
was ever set in the final AST were `=`, `op=`, `&&` and `||`. This introduces
separate head kinds for each of these (similar to how there are already
separate head calls for `dotcall` and `dot). Otherwise the AST structure
should be unchanged.
@Keno Keno requested a review from mlechu July 1, 2025 01:04
@Keno
Copy link
Member Author

Keno commented Jul 1, 2025

I was waffling a bit on this change - I think at the end I decided to still want to do this, but I figured it'd be good to write down my reasoning for future reference.

I think the primary thing I don't like about this is that it really does seem like something like .== should be a single token at lex time. It's indivisible by whitespace and (with the exception of import syntax) not context sensitive. On the other hand, I do think that (. ==) with the correct parsing for it, since (in julia at least) we compose it out of the more primitive concepts of == and broadcasting. Concretely, I don't think anyone would argue that we should parse broadcasting f. as a single token. Yes, the operator character space is more limited, but it's still infinite and basically an arbitrary identifier (subject to the well-formedness constraints of the operator character set). That said, if we go down the road of merging all the operator heads into a single head, I think it might be reasonable to reserve a flag bit to flag valid operators after . to serve the need of token-based highlighters.

More generally, this has gotten me thinking about the whole concept of token splitting and remapping in general. I think it would be nice if we were able to have an invariant that erasing the tree structure of the parse tree gave us back the token stream. With this PR, there's only two uses of bump_split left: One in the import parsing and one in the handling of op=. However, I think the same reasoning that applies here to dot, basically applies to op= as well, in that I don't think it should be an atomic token syntax head. Then we only have import left, which is basically its own thing anyway and could possibly be its own lex context.

I need to mull this over some more, but that seems like a reasonable direction to me.

@Keno Keno merged commit 2deac15 into main Jul 2, 2025
36 checks passed
@Keno Keno deleted the kf/rmdotflag branch July 2, 2025 00:43
@Keno
Copy link
Member Author

Keno commented Jul 2, 2025

Then we only have import left, which is basically its own thing anyway and could possibly be its own lex context.

This doesn't work because import can take interpolations, which can have arbitrary syntax in them, so you can't lex it different.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant