-
Notifications
You must be signed in to change notification settings - Fork 0
JSON Lexer
The lexer converts the raw characters (string, file, etc.) into tokens that are meaningful chunks.
Even though it is also possible to build a parser that directly parses the string, it ist in most cases way more complicated and more error prone. A lexer could be seen as a sort of preprocessor step that removes all the unnecessary whitespace and adds some more context information to groups of characters.
This is also similar to the way humans are reading text as text normally isn't read character by character, but more or less like each word itself is a token and the combination of each word has some meaning. For example: The word "quick" isn't read like 'q' 'u' 'i' 'c' 'k'
, but as a single word.
JSON itself is quite minimal and so it does not need many tokens:
Token | Meaning | Example |
---|---|---|
NULL | The value null in JSON. It is a fixed sequence of characters. |
null |
TRUE | The value true in JSON. It is a fixed sequence of characters. |
true |
FALSE | The value false in JSON. It is a fixed sequence of characters. |
false |
STRING | The string datatype delimited by " (can have escape sequences). |
"Some string" |
NUMBER | A number that can either be an integer or a floating point (can also use scientific notation). |
42 |
CURLY_OPEN | Marks the beginnig of an Object in JSON. | { |
CURLY_CLOSE | Marks the end of an Object in JSON. | } |
SQUARE_OPEN | Marks the beginning of an Array in JSON. | [ |
SQUARE_CLOSE | Marks the end of an Array in JSON. | ] |
COMMA | Used to separate different elements inside an Object or an Array. | , |
COLON | Used to separate the key from the value inside an element of an Object. | : |
EOF | Marks that there are no tokens left (this is the last token) | - |
Note
Any other combination of characters that are being found result in an error.
The lexer always parses if it is being called the characters that it is currently at. Before attempting to parse a single token it is always skipping the unnecessary whitespace until a character that isn't in the whitespace category is being viewed.
The single character tokens ({
, }
, [
, ]
, ,
, :
) can easily be matched by just checking if the character currently being viewed is one of the characters in this list. If this is the case the lexer can just move to the next token and emit the corresponding token as these tokens have a length of
The literals are a sequence of specific characters that should always be an exact match (null
, true
, false
). As all of these tokens have a different character at the start, the current character can be checked if it is either a n
, t
or f
and then check the following characters if they also match. If they match, then the corresponding token can be emitted.
Note
If the first character matches at the beginning, but the rest does not match, then the lexer can throw an error as there is no possibility that there could be a different token that has this schema.
If the input ends and there is nothing anymore to lex, then the EOF
token must be emitted.
This should be checked first and the cursor should not move to the next character as this could lead to unwanted exceptions.
A string always starts with a "
character, can contain any character except quotation mark, reverse solidus, and the control characters (U+0000 through U+001F) inside itself and ends with another "
. But some characters can be escaped with a reverse solidus (\
):
Escaped | Meaning | Unicode value |
---|---|---|
" |
Quotation Mark | U+0022 |
\ |
Reverse Solidus | U+005C |
/ |
Solidus | U+002F |
b |
Backspace | U+0008 |
f |
Form Feed | U+000C |
n |
Line Feed | U+000A |
r |
Carriage Return | U+000D |
t |
Tab | U+0009 |
uXXXX |
Unicode Value | U+XXXX |
The lexer will emit a string token with the value of the string where the escape sequences are already parsed.
The number syntax is a little bit more complicated:

KotlinJsonParser is a toy project created for educational purposes. It is not intended for production use, and while efforts have been made to ensure the code is functional, it may not cover all edge cases or be optimized for performance. Use at your own risk.
But most importantly: Have fun! :D
The content of this wiki is licensed under the CC-BY-NC-SA 4.0 License.
