TRegex DFA DOT data has duplicated state transitions

This is mostly a TRegex issue, but I can induce it with a Ruby regexp and it ties in with the general work around TruffleRuby regexes I've been doing. Making use of TRegex's `DumpAutomata` feature, I have a generated DOT file that duplicates states and makes the graph overall more difficult to read. I do not know if the states are duplicated during execution as well or if this is just an aesthetic issue in the debug data.

Unfortunately, there isn't a way to enable the automata output through standard options passing. For the time being, I've modified 

https://github.com/oracle/truffleruby/blob/ab28c3e48ad478d97e025ee3ad30389f6f695224/src/main/java/org/truffleruby/core/regexp/TRegexCache.java#L175

to be:

```java
String regex = "Flavor=Ruby,DumpAutomata=true,Encoding=" + tRegexEncoding + "/" + processedRegexpSource + "/" + flags;
```

With that change applied and a new TruffleRuby built, I run the following:

```bash
jt ruby -e '10_000.times { "abc" =~ /[[:alnum:]]/ }'
```

Looking at the generated output, we have:

_nfa\_reverse.gv:_

```dot
digraph finite_state_machine {

    node [shape = circle]; "I";
    node [shape = doublecircle]; "F";
    node [shape = circle]; "S(7)[4]";

    "I" -> "S(7)[4]" [ label = "[\\x00-\\xff], p0, 0)(" ];
    "S(7)[4]" -> "F" [ label = ", p0, )(0" ];
}
```

_dfa\_forward:_

```dot
digraph finite_state_machine {
    node [shape = doublecircle]; "S{(7)[4]}";
    node [shape = circle];
    "I^0" -> "S{(1)[3]}" [ label = "" ];
    "S{(1)[3]}" -> "S{(1)[3]}" [ label = "[\\x00-\\x2f\\x3a-\\x40\\x5b-\\x60\\x7b-\\xa9\\xab-\\xb4\\xb6-\\xb9\\xbb-\\xbf\\xd7\\xf7]" ];
    "S{(1)[3]}" -> "S{(7)[4]}" [ label = "[0-9A-Za-z\\xaa\\xb5\\xba\\xc0-\\xd6\\xd8-\\xf6\\xf8-\\xff]" ];
    "I0" -> "S{(1)[3]}" [ label = "" ];
    "S{(1)[3]}" -> "S{(1)[3]}" [ label = "[\\x00-\\x2f\\x3a-\\x40\\x5b-\\x60\\x7b-\\xa9\\xab-\\xb4\\xb6-\\xb9\\xbb-\\xbf\\xd7\\xf7]" ];
    "S{(1)[3]}" -> "S{(7)[4]}" [ label = "[0-9A-Za-z\\xaa\\xb5\\xba\\xc0-\\xd6\\xd8-\\xf6\\xf8-\\xff]" ];
}
```

The best I can surmise is the `DFAGenerator` has two state entries, each with its own transition set. Since the regex pattern doesn't anchor itself to the beginning of the input, two entry states are recorded: one that is anchored (`I^0` in the DFA graph) and one that is not (`I0` in the DFA graph). Aside from that entry state, the rest of the transition sets is identical. The resulting rendered graph is difficult to read as result:

![dfa_forward](https://user-images.githubusercontent.com/12584/132913439-4131a723-b89f-441d-a4bf-e7859e29bd3d.png)

I was initially looking at a more complicated graph and couldn't figure out how to read the edge labels. Looking at the generated _dfa\_forward.gv_ file, I finally realized there were duplicated edges and then came up with this simpler example. In the rendered image you can see the edge from `S{(1)[3]} -> S{(7)[4]}` looks like one really long label. But, it's actually the label for inner edge butting up against the label for the outer edge. Aside from being difficult to read, it can be misleading because I don't think anyone would expect to see two identical edges between two states.

I think the graph would be much easier to read if such duplicate edges were removed. There's a compounding effect here, as well. If the regexp pattern were `/[[:alnum:]]|[[:space:]]/`, you'd see two finish nodes, each with two identical edges coming into it.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

TRegex DFA DOT data has duplicated state transitions #2442

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

TRegex DFA DOT data has duplicated state transitions #2442

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions