Architecture overview

Main parsing loop

Linker-parser implements source parsing by continuously trying to match parser input against ConsumingTokens used to populate terminal tokens defined in grammar tree specified by result token type configured prior to the parsing. A single parser iteration corresponds to a testing cycle of a single ConsumingToken. Each iteration can result in one of several parsing states:

Parser advances to another consuming token in the grammar tree
Parser fails to advance to another token without having the AST populated, which indicates a syntax error in the parser input.
Parser populates the AST before hitting the end of parser input, which may indicate both a syntax error and ambiguous grammar (the parsing algorithm will try to recover from the latter situation by discarding prematurely populated grammar junction and proceeding to next available grammar junctions).
Parser hits the end of input before populating the AST, which may indicate either a syntax error or another situation when parser followed a wrong grammar junction and needs to traceback to another available junction).

Character consumption

Each parser loop iteration normally starts with a consumption cycle in which ConsumingToken::consume is continuously invoked until it returns false.

Internally, ConsumingToken instances usually handle consume calls by advancing their internal pointers to parser buffer associated with AST root token and then passing parts of the buffer that correspond to these pointers to pre-configured TokenMatcher instance that tests them against token's pattern (usually defined either as value of private static final field or with @CapturePattern field annotation) and returns TokenTestResult structure that can contain matched token, its length (which may be bigger than the length of returned matched token), and a TestResult value which can be one of:

TestResult.MATCH - indicates a full token match upon which ConsumingToken can consider itself populated, should stop consuming any more characters and return any characters that did not match the token.
TestResult.MATCH_CONTINUE - indicates a partial token match which usually causes ConsumingToken to continue consuming characters from parser. The token, however, may treat this result as TestResult.MATCH, especially when parser indicated that currently consumed character is the last character in parser input.
TestResult.CONTINUE - indicates a possible token match which usually causes ConsumingToken to continue consuming characters from parser. The token, however, may treat this result as TestResult.FAIL, especially when parser indicated that current character is the last one.
TestResult.FAIL - indicates that this token does not match consumed input and should be discarded. This usually causes ConsumingToken to trigger traceback operation by calling PartialToken::pushback on its parent.

Note: refusal to consume characters at this stage indicates only that current token is done consuming characters and may mean both token failure and success.

Token advancement

After current token refuses to consume next character, TokenGrammar checks if the token was successfully populated. If it wasn't and reported matching failure, then TokenGrammar backtraces from it to the first previously populated token that has untested alternatives and tries to advance from that token, otherwise token advancement is performed from the populated token.

Token advancement is performed by searching the next unpopulated CompoundToken in current AST path and then calling CompoundToken::nextChild on that token. If the call returns another CompoundToken, then parser ascends into that token and repeats token advancement procedure on that token until CompoundToken::nextChild returns a ConsumingToken that can be used to perform the next character consumption operation.

End of input handling

In an ideal parsing operation, the last character of parser input should be consumed by the root token of AST for that input. After that root token usually signals the parser to stop processing by advancing it to null token. However, in some cases, parser may either hit end of input before it receives the null token, or it may have some unparsed characters in the buffer when such token is received. The first case is always treated as error and will result in ParserError exception thrown. The second case is more complicated as it may happen when parser followed wrong alternative while populating the AST. To ensure that this is not the case, TokenGrammar tries to perform several operations on the current token:

Forced token rotation -- may be used in case of mathematical equations.
Forced token traceback -- some tokens in the AST may still has alternative routes available. The parser will trace the AST back using right-handed DFS algorithm until it finds the first token with available alternatives (PartialToken::alternativesLeft > 0) and restart parsing with its next alternative.

Only after failing to either rotate current token or trace AST back to an alternative a ParserError exception will be thrown.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Architecture overview

Main parsing loop

Character consumption

Token advancement

End of input handling

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally