-
Notifications
You must be signed in to change notification settings - Fork 0
Home
Welcome to the linker-parser wiki!
In this quick start guide, we will develop a simple parser for key-value configuration files. In order to do so, we first need to analyze the grammar that will be supported in our configuration files and then implement it as Java classes.
It is a good practice to define a global interface for all tokens that will be part of our grammar. Let's use CongigurationToken
empty interface as such namespace:
package com.example;
import com.onkiup.linker.parser.Rule;
public interface ConfigurationToken extends Rule {
}
In order to keep our example simple, we can define our configuration file as a set of lines where each line can be either a configuration entry with key and value separated by the equals character, a comment that starts with "#" character, or an empty line. Ambiguities like this are called grammar junctions and Linker-Parser will expect us to define them as interfaces so, let's define ConfigurationLine
as our interface for lines in our configuration files:
package com.example;
public interface ConfigurationLine extends ConfigurationToken {
}
... after which we can define a Rule class that will represent a single configuration file:
package com.example;
public class ConfigurationFile implements ConfigurationToken {
private ConfigurationLine[] lines;
}
Now we need to tell our parser what specific forms our configuration lines can take. We can handle empty lines just by putting @IgnoreCharacters
annotation to our ConfigurationFile
rule. This will instruct our parser to ignore such lines altogether:
...
@IgnoreCharacters(" \r\t\n")
public class ConfigurationFile implements ConfigurationToken {
...
The other two variations would require us to define two different grammars for them:
package com.example;
public class CommentLine implements ConfigurationLine {
private static final String MARKER = "#";
@CapturePattern(until="\n")
private String comment;
}
In this class we used private static final String MARKER
field to instruct our parser to match a string constant "#" as the first element of our token. If the parser tries to test our CommentToken
on a key-value line before the ValueLine
grammar rule, then it will fail parsing this CommentToken
rule because the first non-whitespace character on such lines does not match the value of this static field. Then we used CapturePattern
annotation with until
parameter on the comment
field, which instructs the parser to match any characters up until the newline character so any text after the "#" literal and up to the end of the line is captured into the field.
package com.example;
@IgnoreCharacters(" \r\t\n")
public class ValueLine implements ConfigurationLine {
@CapturePattern("[a-zA-Z_$][a-zA-Z0-9_$]*")
private String key;
private static final String EQUALS = "=";
@CapturePattern(until="\n")
private String value;
}
In this class we again used the @IgnoreCharacters
annotation to ignore whitespace characters, this time inside the line itself (whitespace characters at the beginning of a line will be ignored because of the similar annotation on ConfigurationFile
rule. Then we used CapturePattern
annotation, but this time we provided a regular expression that specifies what characters can go into the key
field (please note that you can use only character groups in here, any terminal-based expressions are not supported by the matching algorithm as CapturePattern
expressions must result in Matcher::lookingAt
returning true for the matcher to work). We used a static
field to match another terminal token ("=") and then we used another CapturePattern
annotation to match configuration value up to the end of the line.
After defining grammar classes they can be used to create a new instance of TokenGrammar
class by invoking the TokenGrammar::forClass
method:
TokenGrammar<ConfigurationFile> grammar = TokenGrammar.forClass(ConfigurationFile.class);
Obtained TokenGrammar
instance then can be used to parse our configuration files by passing :
FileReader reader = new FileReader("somefile");
ConfigurationFile config = grammar.parse("somefile", reader);
The @IgnoreCharacters
annotation has one very important limitation, as it allows to only ignore leading characters located before subtokens of annotated token, which leads to situation in which trailing whitespace characters in parser input require special handling. For example, the grammar that we've created so far, will fail to parse any configuration files that end with empty lines or with lines made entirely from whitespace characters. In order to fix that, we need to pass the list of trailing characters to ignore to TokenGrammar::ignoreTrailingCharacters
:
grammar.ignoreTrailingCharacters(" \t\r\n");