-
Notifications
You must be signed in to change notification settings - Fork 0
Architecture overview
Linker-parser implements source parsing by continuously trying to match parser input against ConsumingTokens used to populate terminal tokens defined in grammar tree specified by result token type configured prior to the parsing. A single parser iteration corresponds to a testing cycle of a single ConsumingToken. Each iteration can result in one of several parsing states:
- Parser advances to another consuming token in the grammar tree
- Parser fails to advance to another token without having the AST populated, which indicates a syntax error in the parser input.
- Parser populates the AST before hitting the end of parser input, which may indicate both a syntax error and ambiguous grammar (the parsing algorithm will try to recover from the latter situation by discarding prematurely populated grammar junction and proceeding to next available grammar junctions).
- Parser hits the end of input before populating the AST, which may indicate either a syntax error or another situation when parser followed a wrong grammar junction and needs to traceback to another available junction).
Each parser loop iteration normally starts with a consumption cycle in which ConsumingToken::consume is continuously invoked until it returns false.
Internally, ConsumingToken
instances usually handle consume
calls by advancing their internal pointers to parser buffer associated with AST root token and then passing parts of the buffer that correspond to these pointers to pre-configured TokenMatcher
instance that tests them against token's pattern (usually defined either as value of private static final
field or with @CapturePattern
field annotation) and returns TokenTestResult
structure that can contain matched token, its length (which may be bigger than the length of returned matched token), and a TestResult
value which can be one of:
-
TestResult.MATCH
- indicates a full token match upon whichConsumingToken
can consider itself populated, should stop consuming any more characters and return any characters that did not match the token. -
TestResult.MATCH_CONTINUE
- indicates a partial token match which usually causesConsumingToken
to continue consuming characters from parser. The token, however, may treat this result asTestResult.MATCH
, especially when parser indicated that currently consumed character is the last character in parser input. -
TestResult.CONTINUE
- indicates a possible token match which usually causesConsumingToken
to continue consuming characters from parser. The token, however, may treat this result asTestResult.FAIL
, especially when parser indicated that current character is the last one. -
TestResult.FAIL
- indicates that this token does not match consumed input and should be discarded. This usually causesConsumingToken
to trigger traceback operation by callingPartialToken::pushback
on its parent.
Note: refusal to consume characters at this stage indicates only that current token is done consuming characters and may mean both token failure and success.
After current token refuses to consume next character, TokenGrammar
checks if the token was successfully populated. If it wasn't and reported matching failure, then TokenGrammar
backtraces from it to the first previously populated token that has untested alternatives and tries to advance from that token, otherwise token advancement is performed from the populated token.
Token advancement is performed by searching the next unpopulated CompoundToken
in current AST path and then calling CompoundToken::nextChild
on that token. If the call returns another CompoundToken
, then parser ascends into that token and repeats token advancement procedure on that token until CompoundToken::nextChild
returns a ConsumingToken
that can be used to perform the next character consumption operation.
In an ideal parsing operation, the last character of parser input should be consumed by the root token of AST for that input. After that root token usually signals the parser to stop processing by advancing it to null
token. However, in some cases, parser may either hit end of input before it receives the null
token, or it may have some unparsed characters in the buffer when such token is received. The first case is always treated as error and will result in ParserError
exception thrown. The second case is more complicated as it may happen when parser followed wrong alternative while populating the AST. To ensure that this is not the case, TokenGrammar
tries to perform several operations on the current token:
- Forced token rotation -- may be used in case of mathematical equations.
- Forced token traceback -- some tokens in the AST may still has alternative routes available. The parser will trace the AST back using right-handed DFS algorithm until it finds the first token with available alternatives (
PartialToken::alternativesLeft
> 0) and restart parsing with its next alternative.
Only after failing to either rotate current token or trace AST back to an alternative a ParserError
exception will be thrown.