Question about the Java grammars #4530

revusky · 2025-07-06T14:46:32Z

revusky
Jul 6, 2025

This repository contains various Java grammars: java/java8, java/java9, java/java20 and java/java. I was playing around with them and discovered that the only one which is really usable in terms of performance is the one that java/java. In fact, that one seems to be at least 30 or 40 times faster than the other Java grammars. (I mean, obviously, the parsers generated from them!) In fact, other README files refer to the java/java grammar as the optimized Java grammar.

What I would love for somebody to explain to me is what the specific diffs there are that make the optimized grammar so much faster!!?? (Or conversely, why are the other grammars so much slower?) Surely, this is crying out for some explanation, no?

kaby76 · 2025-07-07T12:44:10Z

kaby76
Jul 7, 2025

This is an excellent question!

The java/java/ grammar is the "optimized" grammar for Java; the other three are derived from the Java Language Specification (JLS). The java/java/ grammar was written long ago by Parr to demonstrate the then "new" Antlr4 capabilities. I don't know whether he started from the JLS, but it seems plausible. Unfortunately, the refactorings performed to optimize the grammar were never documented. I've been trying to reverse-engineer the refactorings, but it's not complete, and I haven't worked on that for a long time.

But, generally, the optimizations fall into several broad categories.

Ambiguity: Antlr, by design, supports parsing given an ambiguous grammar. This allows one to have a functioning parser for a problem that only requires a simple, single application. Disambiguating a grammar is a time-consuming process. The IntelliJ Antlr plugin shows ambiguities.
Left-factoring: Typically, when one writes an LL grammar, it will need to be refactored for common left-factors. The IntelliJ Antlr plugin shows max-k's for grammars that have common left factors.
Small tree rewrites for expressions: Antlr allows a concise parse tree representation for expressions for rules that fit a specific pattern. This reduces the tree size and decreases parse time. IntelliJ doesn't have a metric to identify these problems, but you can "see" the problem if you display the parse tree and see long chains of nodes containing a single child.

1 reply

revusky Jul 7, 2025
Author

Hi Ken. Thanks for the answer. It is an intriguing situation. But, I mean, surely you agree that it is not just that the optimized grammar is faster. It's like at least 30 or 40 times faster! A priori it seems to me like it would be very surprising that you could get even 30 or 40% improvement. One would tend to think that the authors of the "unoptimized" grammars must be making some incredible mistakes, no? Otherwise...

By the way, the Java grammar that is part of the parser generator project that is mostly my fault, CongoCC, can be eyeballed here. As far as I can tell, this is a very precise implementation of the JLS. I mention this in case anybody is interested. Oh, to be clear, I did not pose the above question with the ulterior motive of plugging my work. Really, I didn't. (Though anybody can believe me or not! LOL.) I really would just love to understand how the optimized grammar can be that much faster. That there is no explanation of this is in itself quite striking! I mean, surely, this case would be an object lessons in which usage patterns to avoid when using ANTLR!

The aforementioned CongoCC grammar generates a parser that is just about twice as fast as the "optimized" ANTLR grammar. Anybody who wants to play with it can try the following magical incantation:

  git clone https://github.com/congo-cc/congo-parser-generator congo
  cd congo
  ant
  cd examples/java
  ant
  java JParse -s

It can run faster if you do it multi-threaded. So you could also try java JParse -p -q -s where the -p runs the test harness multi-threaded (and the -q is for quieter output.)

kaby76 · 2025-07-07T21:12:55Z

kaby76
Jul 7, 2025

Indeed, it is often the case that an unoptimized grammar can be 1 to 2 orders of magnitude slower than the optimized grammar. This is because the Antlr parser engine is a straightforward implementation of an NFA graph interpreter with some DFA graph caching. Many DFA states are computed in ambiguous and non-left-factored grammars as compared to optimized grammars. But, as you say, this isn't the whole story.

Even for optimized grammars, it'll still be slow because the interpreter starts with an empty DFA graph. AdaptivePredict() is the code that computes a choice at the point where a choice has to be made in the parse, at points where there is a |-operator in the grammar. It computes a DFA sub-graph addition given the following sequence of tokens in the input and the NFA graph. This computation is very intense.

There has been some discussion of preloading the DFA cache before parsing: antlr/antlr4#3682. This would speed up parse times, but it is unclear by how much in a complete parse, as it would be offset by the time required to preload the cache. And, in my opinion, I don't think the data structures are the best representation of the DFA graph.

For the Java grammars, I think it is essential that grammars-v4/java/* be reorganized and documented more clearly. It's okay to provide unoptimized grammars, but people expect an optimized grammar and expect it to perform reasonably well.

When a grammar changed, I added code to test and output ambiguity, but most contributors don't look at it.

6 replies

kaby76 Jul 15, 2025

Well, I don't know how to put this diplomatically, Ken, but why on earth would anybody use this parser of yours?

First, the grammars here are not mine and Antlr is not my work.

The grammars here are the contributions of people who offer no promises of correctness or performance. I wouldn't have any expectations. And certainly, I would not recommend using Antlr4 or any of the grammars in a software product; there are many issues in the design of Antlr4 and in the ambiguity of the grammars.

What is Antlr4 good for? Antlr4 sprang out of a research project on ALL(*). The purpose of Antlr4 was to:

Accept grammars with ambiguity and direct left recursion;
Show that DFA memoization in ALL(*) would give reasonable parse times;
Explore ways to display problems (e.g., ambiguity) in a grammar.

ALL(*) is just another technique in the large number of parsing algorithms.

The purpose and requirements for Congocc are not the same as for Antlr4. In particular:

It does not flag grammar ambiguity. E.g., Expr : <NUMBER> | B; B : <NUMBER>; => "take the first alt."
It does not accept direct or indirect left recursion. E.g., Expr : Expr <NUMBER> | <NUMBER>; results in a stack overflow in the parser generator and no explanation as to why.

KvanTTT Jul 17, 2025
Collaborator

It is rather surprising that the bug was reported nearly 3 years ago and you guys never heard back about it.

Feel free to raise an issue and fix the bug.

As for the note I am replying to, when you refer to "DFA graph caching", you are talking about what is typically called (by ANTLR people) the ATN ("Augmented Transitional Network") apparently. No?

ATN and DFA are different things in ANTLR. ATN is a static representation of a grammar, DFA is dynamic structure that is being built during parsing process. The more time parser spends during parsing, the faster parser time becomes (in theory).

I have noted that very long runs of any of the parsers require frightening amounts of memory. For example, this "optimized" grammar, the parser it generates can run over and parse the entire src.zip from the JDK 8. However, it needs about 400 megs of heap space to do it. (I verified this by experimenting with different values of -Xmx... when launching the test harness. The parser that is mostly my work does the same task perfectly well with 20 megs of heap, which is a pretty dramatic differences.

ANTLR is a tool that far from ideal for optimal parsing in terms of both performance and memory.

Well, I don't know how to put this diplomatically, Ken, but why on earth would anybody use this parser of yours?

Nobody is obliged to use this grammar or any other grammar in the repository. And it's not Ken's grammar as he already explained. The ANTLR and its grammars are nonprofit projects and it's quite strange to see claims like this.

revusky Jul 17, 2025
Author

It is rather surprising that the bug was reported nearly 3 years ago and you guys never heard back about it.

Feel free to raise an issue and fix the bug.

But I just did, didn't I? I told you that the grammar in question does not take into account that the last record component can be a varargs. So, rather than have:

   recordComponent
     : typeType identifier
   ;

You need something like:

   recordComponent
    : typeType ('...')? identifier
   ;

But that's not a complete solution. If a recordComponent has a varargs, it must be the last one. It seems the simplest solution would be to that the token after identifier NOT be ',' if there was a '...'. But I don't know how to specify that in ANTLR. Anyway, I reported the issue and told you how you could fix it.(Shrug.)

As for the note I am replying to, when you refer to "DFA graph caching", you are talking about what is typically called (by ANTLR people) the ATN ("Augmented Transitional Network") apparently. No?

ATN and DFA are different things in ANTLR.

Okay, I stand corrected. I honestly just don't know. I never looked into this.

ATN is a static representation of a grammar, DFA is dynamic structure that is being built during parsing process. The more time parser spends during parsing, the faster parser time becomes (in theory).

Ah, I see. So that is what the -2x option in the test harness that allows to run over the files a second time. The second time should take less time. Or to express it differently, it should take less than twice as long as just running the thing once.

I was curious and ran the experiment, parsing over the JDK 8 src.zip files, as in:

   java Test ~/jdksrc/8

and it successfully parsed it all and took about 85 seconds.

  Total lexer+parser time 84924ms.

I then ran:

    java Test -2x ~/jdksrc/8

    Total lexer+parser time 74411ms.

The latter total is pretty obviously for the second pass. So, you're right. The second pass is faster, about 14% faster than the first one.

Well, okay, that's just one datum, I grant. One's mileage can vary. Do you know offhand of a case where the speed improvement is dramatically better than that? Because 14% barely looks like it's worth concerning oneself with. Or maybe the great speed gains come after a much longer run than this. Maybe 17,000 source files is just not a sufficient warm-up. (Да, ладно...)

If I run my own Java parser (the one that is built into the CongoCC project), as in:

   java JParse ~/jdksrc/8

and I get:

   Parsed 17676 files successfully
   Failed on 0 files

   Duration: 51254 milliseconds

Of course, there would not be any speed improvement running it twice. So it would take something like 100 seconds. (I didn't even check, because there is no caching going on that would make it any faster!) On the other hand, the two runs of the "optimized" Java parser provided by the ANTLR community take about 160 seconds.

I have noted that very long runs of any of the parsers require frightening amounts of memory. For example, this "optimized" grammar, the parser it generates can run over and parse the entire src.zip from the JDK 8. However, it needs about 400 megs of heap space to do it. (I verified this by experimenting with different values of -Xmx... when launching the test harness. The parser that is mostly my work does the same task perfectly well with 20 megs of heap, which is a pretty dramatic differences.

ANTLR is a tool that far from ideal for optimal parsing in terms of both performance and memory.

Well, I don't know how to put this diplomatically, Ken, but why on earth would anybody use this parser of yours?

Nobody is obliged to use this grammar or any other grammar in the repository.

Well, okay. Thanks for telling me that. (What a relief!)

And it's not Ken's grammar as he already explained. The ANTLR and its grammars are nonprofit projects and it's quite strange to see claims like this.

What claims? I never "claimed" that Ken was the author of the grammar. When I said "your" grammar, I meant "you" collectively. But regardless, what difference does it even make? If little green men from Mars wrote this stuff, whatever objective facts I outline above would be the same, no?

The other problem with this grammar (and the other ones, to only a somewhat lesser extent) is the extent to which they are incorrect! But I'll write a note about that separately.

kaby76 Jul 21, 2025

I added a PR to address a couple of issues mentioned here. #4556 changes the recordComponent rule to conform with JLS 24. The PR outlines the refactorings on the JLS EBNF to derive the version presented in the PR. Even with the change, the grammar is far behind the current JLS. It should be scraped from the Spec, not adjusted piecemeal.

The PR also adds a static semantic check for varargs in a recordComponentList. While there is a test input to exercise the record type, I didn't add a negative test. But, it should be included.

Some have stated that static semantic checks should be implemented in a phase after parsing. While the varargs/recordComponentList check could be performed after a parse, disambiguating predicates would be harder to implement given that Antlr4 outputs a single parse tree, too late to correct a bad choice in the parse.

The PR also adds a script for performance/validation testing of the grammar-generated parser against JDK version 8. The script downloads the test files, builds a parser app (testing templates for Java), and performs the testing three times, which is particularly important on Windows with anti-virus software. The script doesn't call Octave to compute statistics and output a graph on performance, but it should.

revusky Jul 25, 2025
Author

#4556 changes the recordComponent rule to conform with JLS 24.

Hi Ken. It's good that you're working on improving things. I'm perfectly happy to help you a bit, even though this is ostensibly a competing product. Really, I am. In fact, as a result of this conversation, I made a couple of improvements to my own Java grammar/parser. which I'll explain below.

Now, what you say above is a slight inaccuracy. The fact that a record component (the last one only) can be varargs has been the case since JDK 16, when the feature was introduced. I actually just looked it up out of curiosity. If you look at the holy writ here on page 328 of the JLS for JDK 16, you can see the text:

A record component may be a variable arity record component, indicated by an ellipsis following the type. At most one variable arity record component is permitted for a record class. It is a compile-time error if a variable arity record component appears anywhere in the list of record components except the last position.

That is from the JDK 16 spec, which is the first version that had the records feature. That is from early 2021. But whatever. Better late than never. Well, okay, maybe you're aware of all that, but what you wrote could be interpreted as meaning that this is an issue that specifically relates to JDK 24, which it does not.

This business of the varargs record component is... well, it's easy to make oversights like that. There is a more general problem with the Java grammar we're talking about, namely that it does not incorporate some of the most basic knowledge of what a valid statement in the Java language is. For example, this grammar seems to "think" that this is a valid statement in Java:

x;

Or...

        2+2;

(shrug)

In fact, I recently put together a list of different things that the Java parser we can generate in Python will accept. (The one in Java does not accept these things, but I was narrowing in on issues with the one generated in Python.) I posted some sample code to illustrate what I was talking about:

public class Foobar {
    void foo() {
       foobar()++;
       ++this;
       x = 7++;
       this = 7;
       7++;
       -8;
       x + 3;
       (x()++);
       x?y:z = t;
    }
}

I suppose it should jump out at one that every last line in the above foo() method is invalid. Of course, x; is not a valid statement in Java, the bottom line being that for an expression to be a valid statement (when you tack a semicolon on the end, obviously!) that expression must be one of:

An assignment
A method (or explicit constructor) invocation
The instantiation of an object. (Specifically a reference object.)

Thus x=7; is a valid statement. So is x(); and so is new x();

And, of course x++ is a valid statement since it is just shorthand for x = x+1;.

Anyway, the Java parser generated from this Java grammar parses the above code without complaint. The main problem can be traced to this line. Whoever wrote that seems to believe that any Java expression is a valid statement if you tack a semicolon onto it. (Of they just think that's "close enough for government work.")

Of course, a statement can be an assignment, but the LHS of the assignment still must be something you can assign a value to. Thus, x() = y; is not a valid statement. I just tried the following line:

   Foo::bar = 23;

and your (I know it's not "yours" individually) parser just "parses" it with no complaint. The CongoCC parser generates:

 Encountered an error at Foo.java:3:9
 Assertion at: Java.ccc:1083:7 failed. Expression Foo::bar cannot be assigned to.
    at Foo.java:3:22 in StatementExpression(Java.ccc:1083:7,JavaParser.java:8045)
    at Foo.java:3:9 in ExpressionStatement(Java.ccc:1124:23,JavaParser.java:8089)
    at Foo.java:3:9 in Statement(Java.ccc:1000:3,JavaParser.java:7585)
    at Foo.java:3:9 in BlockStatement(Java.ccc:1043:3,JavaParser.java:7779)
    at Foo.java:3:9 in Block(Java.ccc:1026:50,JavaParser.java:7687)
    at Foo.java:2:32 in MethodDeclaration(Java.ccc:500:5,JavaParser.java:3216)
    at Foo.java:2:5 in ClassOrInterfaceBodyDeclaration(Java.ccc:427:3,JavaParser.java:2767)
    at Foo.java:2:5 in ClassOrInterfaceBody(Java.ccc:418:54,JavaParser.java:2706)
    at Foo.java:1:12 in ClassDeclaration(Java.ccc:282:3,JavaParser.java:1586)
    at Foo.java:1:1 in TypeDeclaration(Java.ccc:218:5,JavaParser.java:1417)
    at Foo.java:1:1 in CompilationUnit(Java.ccc:107:5,JavaParser.java:1010)
    at Foo.java:1:1 in Root(Java.ccc:34:4,JavaParser.java:414)
Parse failed on: Foo.java

If you give it the line: x;, it generates:

Assertion at: Java.ccc:1092:7 failed. Expression at Foo.java:3:9 is not a valid statement.
Expecting a method call, an assignment or a expression that calls an object constructor (i.e. 
new Foobar(...) or an explicit constructor invocation)
    at Foo.java:3:10 in StatementExpression(Java.ccc:1092:7,JavaParser.java:8051)
    at Foo.java:3:9 in ExpressionStatement(Java.ccc:1124:23,JavaParser.java:8089)
    at Foo.java:3:9 in Statement(Java.ccc:1000:3,JavaParser.java:7585)
    at Foo.java:3:9 in BlockStatement(Java.ccc:1043:3,JavaParser.java:7779)
    at Foo.java:3:9 in Block(Java.ccc:1026:50,JavaParser.java:7687)
    at Foo.java:2:32 in MethodDeclaration(Java.ccc:500:5,JavaParser.java:3216)
    at Foo.java:2:5 in ClassOrInterfaceBodyDeclaration(Java.ccc:427:3,JavaParser.java:2767)
    at Foo.java:2:5 in ClassOrInterfaceBody(Java.ccc:418:54,JavaParser.java:2706)
    at Foo.java:1:12 in ClassDeclaration(Java.ccc:282:3,JavaParser.java:1586)
    at Foo.java:1:1 in TypeDeclaration(Java.ccc:218:5,JavaParser.java:1417)
    at Foo.java:1:1 in CompilationUnit(Java.ccc:107:5,JavaParser.java:1010)
    at Foo.java:1:1 in Root(Java.ccc:34:4,JavaParser.java:414)
Parse failed on: Foo.java

Well, I improved the error messages in the last few days as a result of this conversation. Also, I realized that, until earlier today, it was accepting:

     new int[7];

as a valid statement, which it isn't, of course. This was based on a slip on my part, where I took the notion that any object instantiation is a valid statement, but that does not include an array creation, as above. Again,

   new X();

is a valid statement because it is calling class X's constructor (which could be empty do-nothing but never mind, it's still invoking it...)

I suppose you could argue that one could accept all of this screwy input and then walk the tree afterwards and identify these things. Or maybe have a "listener" or whatever. That is true, I suppose, but it does kind of beg the question of what you think that a Java parser is supposed to do!!??

I mean, let's face it. For any developer out there, it takes a non-negligible amount of work to get this thing working and integrate it into their overall project or toolstack. Why would somebody make that effort to integrate a Java parser that does not incorporate some of the most basic knowledge about what is valid in Java, like what is a valid statement, for example...

But, anyway, as a result of this conversation, I actually improved our tool a bit. I would be happy to have a friendly competition as to who can develop the best Java parser. No need for it to be acrimonious. And, after all, the end user can only benefit from such a rivalry.

Are you guys up to it?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Question about the Java grammars #4530

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 7 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Question about the Java grammars #4530

Uh oh!

revusky Jul 6, 2025

Replies: 2 comments · 7 replies

Uh oh!

kaby76 Jul 7, 2025

Uh oh!

revusky Jul 7, 2025 Author

Uh oh!

kaby76 Jul 7, 2025

Uh oh!

kaby76 Jul 15, 2025

Uh oh!

Uh oh!

KvanTTT Jul 17, 2025 Collaborator

Uh oh!

revusky Jul 17, 2025 Author

Uh oh!

kaby76 Jul 21, 2025

Uh oh!

revusky Jul 25, 2025 Author

revusky
Jul 6, 2025

Replies: 2 comments 7 replies

kaby76
Jul 7, 2025

revusky Jul 7, 2025
Author

kaby76
Jul 7, 2025

KvanTTT Jul 17, 2025
Collaborator

revusky Jul 17, 2025
Author

revusky Jul 25, 2025
Author