Skip to content

TRegex DFA DOT data has duplicated state transitions #2442

Open
@nirvdrum

Description

@nirvdrum

This is mostly a TRegex issue, but I can induce it with a Ruby regexp and it ties in with the general work around TruffleRuby regexes I've been doing. Making use of TRegex's DumpAutomata feature, I have a generated DOT file that duplicates states and makes the graph overall more difficult to read. I do not know if the states are duplicated during execution as well or if this is just an aesthetic issue in the debug data.

Unfortunately, there isn't a way to enable the automata output through standard options passing. For the time being, I've modified

String regex = "Flavor=Ruby,Encoding=" + tRegexEncoding + "/" + processedRegexpSource + "/" + flags;

to be:

String regex = "Flavor=Ruby,DumpAutomata=true,Encoding=" + tRegexEncoding + "/" + processedRegexpSource + "/" + flags;

With that change applied and a new TruffleRuby built, I run the following:

jt ruby -e '10_000.times { "abc" =~ /[[:alnum:]]/ }'

Looking at the generated output, we have:

nfa_reverse.gv:

digraph finite_state_machine {

    node [shape = circle]; "I";
    node [shape = doublecircle]; "F";
    node [shape = circle]; "S(7)[4]";

    "I" -> "S(7)[4]" [ label = "[\\x00-\\xff], p0, 0)(" ];
    "S(7)[4]" -> "F" [ label = ", p0, )(0" ];
}

dfa_forward:

digraph finite_state_machine {
    node [shape = doublecircle]; "S{(7)[4]}";
    node [shape = circle];
    "I^0" -> "S{(1)[3]}" [ label = "" ];
    "S{(1)[3]}" -> "S{(1)[3]}" [ label = "[\\x00-\\x2f\\x3a-\\x40\\x5b-\\x60\\x7b-\\xa9\\xab-\\xb4\\xb6-\\xb9\\xbb-\\xbf\\xd7\\xf7]" ];
    "S{(1)[3]}" -> "S{(7)[4]}" [ label = "[0-9A-Za-z\\xaa\\xb5\\xba\\xc0-\\xd6\\xd8-\\xf6\\xf8-\\xff]" ];
    "I0" -> "S{(1)[3]}" [ label = "" ];
    "S{(1)[3]}" -> "S{(1)[3]}" [ label = "[\\x00-\\x2f\\x3a-\\x40\\x5b-\\x60\\x7b-\\xa9\\xab-\\xb4\\xb6-\\xb9\\xbb-\\xbf\\xd7\\xf7]" ];
    "S{(1)[3]}" -> "S{(7)[4]}" [ label = "[0-9A-Za-z\\xaa\\xb5\\xba\\xc0-\\xd6\\xd8-\\xf6\\xf8-\\xff]" ];
}

The best I can surmise is the DFAGenerator has two state entries, each with its own transition set. Since the regex pattern doesn't anchor itself to the beginning of the input, two entry states are recorded: one that is anchored (I^0 in the DFA graph) and one that is not (I0 in the DFA graph). Aside from that entry state, the rest of the transition sets is identical. The resulting rendered graph is difficult to read as result:

dfa_forward

I was initially looking at a more complicated graph and couldn't figure out how to read the edge labels. Looking at the generated dfa_forward.gv file, I finally realized there were duplicated edges and then came up with this simpler example. In the rendered image you can see the edge from S{(1)[3]} -> S{(7)[4]} looks like one really long label. But, it's actually the label for inner edge butting up against the label for the outer edge. Aside from being difficult to read, it can be misleading because I don't think anyone would expect to see two identical edges between two states.

I think the graph would be much easier to read if such duplicate edges were removed. There's a compounding effect here, as well. If the regexp pattern were /[[:alnum:]]|[[:space:]]/, you'd see two finish nodes, each with two identical edges coming into it.

Metadata

Metadata

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions