Skip to content
Dmitrii Chechetkin edited this page Oct 7, 2019 · 4 revisions

Welcome to the linker-parser wiki!

Quickstart guide

In this quick start guide, we will develop a simple parser for key-value configuration files. In order to do so, we first need to analyze the grammar that will be supported in our configuration files and then implement it as Java classes.

Global interface as grammar namespace

It is a good practice to define a global interface for all tokens that will be part of our grammar. Let's use CongigurationToken empty interface as such namespace:

package com.example;
import com.onkiup.linker.parser.Rule;
public interface ConfigurationToken extends Rule {

}

Analyzing future grammar

In order to keep our example simple, we can define our configuration file as a set of lines where each line can be either a configuration entry with key and value separated by the equals character, a comment that starts with "#" character, or an empty line. Ambiguities like this are called grammar junctions and Linker-Parser will expect us to define them as interfaces so, let's define ConfigurationLine as our interface for lines in our configuration files:

package com.example;

public interface ConfigurationLine extends ConfigurationToken {

}

... after which we can define a Rule class that will represent a single configuration file:

package com.example;
public class ConfigurationFile implements ConfigurationToken {
    private ConfigurationLine[] lines;
}

Implementing special cases

Now we need to tell our parser what specific forms our configuration lines can take. We can handle empty lines just by putting @IgnoreCharacters annotation to our ConfigurationFile rule. This will instruct our parser to ignore such lines altogether:

... 
@IgnoreCharacters(" \r\t\n")
public class ConfigurationFile implements ConfigurationToken {
...

The other two variations would require us to define two different grammars for them:

package com.example;

public class CommentLine implements ConfigurationLine {
    private static final String MARKER = "#";

    @CapturePattern(until="\n")
    private String comment;

}

In this class we used private static final String MARKER field to instruct our parser to match a string constant "#" as the first element of our token. If the parser tries to test our CommentToken on a key-value line before the ValueLine grammar rule, then it will fail parsing this CommentToken rule because the first non-whitespace character on such lines does not match the value of this static field. Then we used CapturePattern annotation with until parameter on the comment field, which instructs the parser to match any characters up until the newline character so any text after the "#" literal and up to the end of the line is captured into the field.

package com.example;

@IgnoreCharacters(" \r\t\n")
public class ValueLine implements ConfigurationLine {
  @CapturePattern("[a-zA-Z_$][a-zA-Z0-9_$]*")
  private String key;
  private static final String EQUALS = "=";
  @CapturePattern(until="\n")
  private String value;
}

In this class we again used the @IgnoreCharacters annotation to ignore whitespace characters, this time inside the line itself (whitespace characters at the beginning of a line will be ignored because of the similar annotation on ConfigurationFile rule. Then we used CapturePattern annotation, but this time we provided a regular expression that specifies what characters can go into the key field (please note that you can use only character groups in here, any terminal-based expressions are not supported by the matching algorithm as CapturePattern expressions must result in Matcher::lookingAt returning true for the matcher to work). We used a static field to match another terminal token ("=") and then we used another CapturePattern annotation to match configuration value up to the end of the line.

Using the parser

After defining grammar classes they can be used to create a new instance of TokenGrammar class by invoking the TokenGrammar::forClass method:

TokenGrammar<ConfigurationFile> grammar = TokenGrammar.forClass(ConfigurationFile.class);

Obtained TokenGrammar instance then can be used to parse our configuration files by passing :

FileReader reader = new FileReader("somefile");
ConfigurationFile config = grammar.parse("somefile", reader);

Handling trailing whitespace

The @IgnoreCharacters annotation has one very important limitation, as it allows to only ignore leading characters located before subtokens of annotated token, which leads to situation in which trailing whitespace characters in parser input require special handling. For example, the grammar that we've created so far, will fail to parse any configuration files that end with empty lines or with lines made entirely from whitespace characters. In order to fix that, we need to pass the list of trailing characters to ignore to TokenGrammar::ignoreTrailingCharacters:

grammar.ignoreTrailingCharacters(" \t\r\n");
Clone this wiki locally