Tree rewriting #905

Xophmeister · 2025-03-11T14:56:23Z

Xophmeister
Mar 11, 2025
Maintainer

This topic has been on my mind for some time and has recently come up from external collaborators. Rather than posting this as a feature request issue, I'll post it as a discussion...for reasons that will (I hope) become clear.

The problem

Sometimes it's useful to be able to modify a node. We currently do this (sparingly) with insertions -- using @{append,prepend}_delimiter -- and deletions. However, this suffers from at least the following problems:

It's a bit cumbersome to write these queries.
The order in which queries are evaluated matters and, anecdotally, there's a "complexity limit" where this technique breaks down.¹
It's impossible to change the order of nodes. (This discussion is partly prompted by an idea from @ctdunc of transpiling Markdown into HTML, in which the order of nodes in links effectively switches; see the example below.)

Thoughts

Tree-sitter queries are a bit like "higher-dimensional regular expressions" (i.e., acting on trees, rather than strings), so maybe we can push that analogy on the familiar s/<pattern>/<replacement>/ idiom. Since the queries give us pattern matching for free, maybe something like this would work:

; Rewrite Markdown links as HTML
; i.e. [text](link)  -->  <a href="link">text</a> 
(
  (link
    (link_text) @_1
    (link_href) @_2
  ) @rewrite

  (#rewrite! "<a href=\"\2\">\1</a>")
)

On the surface, this might seem acceptable, but there are deeper problems that prevent this from working as one might expect. We're in the realm of tree tranducers (specifically, what's proposed above is a tree-to-string transducer) and these aren't as well-behaved as their lower-dimensional, regular expression counterparts.

For example:

Topiary's normal capture names (e.g., @append_space) could not appear in a @rewrite query as it wouldn't make sense (with the possible exception of @do_nothing). This has consequences on query parsing and matching, which now need to keep track of two states.
Rewriting a subtree is going to remove any multi-line-ness from the input (unless it's explicitly added into the rewrite string). This flouts one of Topiary's key design principles.

While the above are certainly problems, they're (probably) not insurmountable. However, there's also a genuine deal breaker: the evaluation order problem, mentioned above with inserts-and-deletes, becomes pathological. For example, consider a rewrite rule that targets tuples of the form (A,B), where A and B could themselves be tuples (i.e., nesting is allowed), or otherwise. Something like:

(
  (tuple
    "(" . (_) @_1 . "," . (_) @_2 . ")"
  ) @rewrite

  (#rewrite! REWRITE_RULE)
)

Imagine this rule appears three times in a query file -- as there's no reason they can't coexist -- with different values for REWRITE_RULE:

Swap: "(\2,\1)"
Replace: "(\1, \1)"
Surround: "(\1,(\2,\1))"

In this case, what should happen to the tuple (1,(2,3))?

((2,2),1)
(1,1)
(1,((2,2),1))
((2,(3,2)),1)
((3,2),1)
(1,(3,2),1)

This is just a simple example! Broadly speaking, the output from a tree transducer should not exist in the same "space" as its input, to prevent this recursive/ambiguity problem.² However, that's not the case with the Topiary formatting engine; this class of transformation is very different from what Topiary currently does.

However, there is potentially scope for implementing a deterministic top-down tree tranducer (DTOP) engine within Topiary to enable rewrites. This would have to be separate from formatting (e.g., exposed through topiary rewrite) and would necessitate a different set of queries, completely orthogonal to the formatting queries. This engine would have to look at the Tree-sitter tree holistically, as a tree-to-tree DTOP, rather than the piecemeal approach that the formatting engine takes. That could be to Tree-sitter what XSLT is to XML.

If this isn't jumping the gun, a combined pipeline -- which first rewrites (with DTOP + rewriting queries), then formats (i.e., as now, with formatting queries) and, when checking for idempotency, formats again -- may be viable end goal...

(Special thanks to @nbacquey for his technical input on this 🙏)

Speaking for myself, note that the "complexity limit" is more on the query author than the Topiary engine itself. It becomes intractable to keep in mind all the possible interactions as more simple rewrites are added. ↩
Another solution would be to only allow one rewrite rule, which could only be applied at most once...but that would be so limiting as to be useless! ↩

ctdunc · 2025-03-11T22:34:53Z

ctdunc
Mar 11, 2025

One possible way out of this rewrite conunudrum is to only allow one @rewrite directive per capture, but allow multiple #rewrite! predicates.

Assuming we apply rewrites from the inside out, and top-down (i.e. in order swap, replace, surround), e.g.

(
  (tuple
    "(" . (_) @_1 . "," . (_) @_2 . ")"
  ) @rewrite

  (#rewrite! "(\2, \1)") ;swap
  (#rewrite! "(\1, \1)") ;replace
  (#rewrite! "(\1, (\2, \1))") ;surround
)

It becomes clearer (to me at least) that the expected output of the preceding example applied to (a, (b,c)) would be ((c,c), ((c,c), (c,c))).

(a, (b,c)) -> ((c,b), a) -> ((c,c), (c,c)) -> ((c,c), ((c,c), (c,c)))

At the very least, the requirement that (#rewrite!) directives for the same capture should be grouped together seems like a sensible thing to do. This would require us to check alternations to make sure that there is no overlap, which I am not sure of the feasibility of.

1 reply

Xophmeister Mar 12, 2025
Maintainer Author

This would require us to check alternations to make sure that there is no overlap, which I am not sure of the feasibility of.

I believe this is a unification problem...which is also non-trivial 😅

yannham · 2025-03-12T09:10:02Z

yannham
Mar 12, 2025
Maintainer

Random thoughts:

one possibility to force determinism is to just forbid this situation. That is, Topiary would refuse to apply more that one rewrite rule per node, and fail otherwise. One disadvantage of this is that you might only encounter this issue at runtime with a very specific repro case, where the user will be penalized, which isn't great.
a second possibility is to perform some simple static analysis to forbid that at query parsing time. In fact I think when @ctdunc says "to only allow one @rewrite directive per capture", this implies the ability to compute this property. That could be more rigid that the previous dynamic check, but would have the advantage of being a property of the queries alone, and not of their interaction with a particular piece of code (very similar to static typing vs dynamic typing). However, although I'm not a tree-sitter expert, it's not obvious at all that this is easy to determine, as tree-sitter queries are pretty liberal. I haven't thought very hard to the problem though, maybe it's standard stuff (I imagine the decision problem is, given query A and B, does there exist a subtree t such that both A and B matches t, or equivalently, are A and B disjoints, when no such t exists)
A last point, which is more a question for the @tweag/topiary-core-team : is the multiple rewrite rules case something that you think could happen in practice (that is, something that we want to support at least like @ctdunc propose), or more like a pathological case that we want to exclude?

2 replies

Xophmeister Mar 12, 2025
Maintainer Author

A last point, which is more a question for the @tweag/topiary-core-team : is the multiple rewrite rules case something that you think could happen in practice (that is, something that we want to support at least like @ctdunc propose), or more like a pathological case that we want to exclude?

I think the example, of using the same rewrite rule three times, is meant to illustrate the pathological case. However, one could imagine a situation where rewrite queries overlap, which I believe brings us back to the same problem.

yannham Mar 12, 2025
Maintainer

I guess my question is: do you expect people to reasonably rely on queries with multiple rewrite rules for the same node? Is it reasonable, or rather an abomination and most likely an error? Depending on the answer, I guess Topiary could make different choices (such as banning it entirely, or having to accommodate it and give it a proper semantics). If it's supposed to be rare and mostly a logic error, another possible solution could be to say: rewrite are not commutative, they are taken in query definition order (or even you can say the order is non-specified and arbitrary chosen by Topiary, a bit like undefined behavior or implementation-defined behavior), so just please don't do that and please make sure rewrite aren't overlapping. As a variant of the first naive proposal, we could emit a warning at runtime if someone tries to do that instead of hard failure.

However, if it can make sense to have overlapping rewrite rules (beyond overlapping making sense, you could have cases where making them non overlapping is technically possible but is just so much harder and laborious to do).

Xophmeister · 2025-03-12T10:08:54Z

Xophmeister
Mar 12, 2025
Maintainer Author

Another random thought (this is definitely the discussion for them 😅)

XSLT is pretty mature. For tree-to-tree transduction, rather than writing a rewrite engine for Tree-sitter CSTs, we could just implement a Tree-sitter CST to XML convertor -- which should be pretty easy¹ -- then perform the rewrites with XSLT, then serialise the resulting XML back into code, ready for Topiary to format.

The topiary visualise command does something similar, for example, to output JSON. ↩

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Tree rewriting #905

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 3 comments 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Tree rewriting #905

Uh oh!

Uh oh!

Xophmeister Mar 11, 2025 Maintainer

The problem

Thoughts

Footnotes

Replies: 3 comments · 3 replies

Uh oh!

ctdunc Mar 11, 2025

Uh oh!

Xophmeister Mar 12, 2025 Maintainer Author

Uh oh!

yannham Mar 12, 2025 Maintainer

Uh oh!

Xophmeister Mar 12, 2025 Maintainer Author

Uh oh!

yannham Mar 12, 2025 Maintainer

Uh oh!

Uh oh!

Xophmeister Mar 12, 2025 Maintainer Author

Footnotes

Xophmeister
Mar 11, 2025
Maintainer

Replies: 3 comments 3 replies

ctdunc
Mar 11, 2025

Xophmeister Mar 12, 2025
Maintainer Author

yannham
Mar 12, 2025
Maintainer

Xophmeister Mar 12, 2025
Maintainer Author

yannham Mar 12, 2025
Maintainer

Xophmeister
Mar 12, 2025
Maintainer Author