JSON Lexer

The lexer converts the raw characters (string, file, etc.) into tokens that are meaningful chunks.

Why a lexer?

Even though it is also possible to build a parser that directly parses the string, it ist in most cases way more complicated and more error prone. A lexer could be seen as a sort of preprocessor step that removes all the unnecessary whitespace and adds some more context information to groups of characters.

This is also similar to the way humans are reading text as text normally isn't read character by character, but more or less like each word itself is a token and the combination of each word has some meaning. For example: The word "quick" isn't read like 'q' 'u' 'i' 'c' 'k', but as a single word.

Tokens

JSON itself is quite minimal and so it does not need many tokens:

Token	Meaning	Example
NULL	The value `null` in JSON. It is a fixed sequence of characters.	`null`
TRUE	The value `true` in JSON. It is a fixed sequence of characters.	`true`
FALSE	The value `false` in JSON. It is a fixed sequence of characters.	`false`
STRING	The string datatype delimited by `"` (can have escape sequences).	`"Some string"`
NUMBER	A number that can either be an integer or a floating point (can also use scientific notation).	`42`
CURLY_OPEN	Marks the beginnig of an Object in JSON.	`{`
CURLY_CLOSE	Marks the end of an Object in JSON.	`}`
SQUARE_OPEN	Marks the beginning of an Array in JSON.	`[`
SQUARE_CLOSE	Marks the end of an Array in JSON.	`]`
COMMA	Used to separate different elements inside an Object or an Array.	`,`
COLON	Used to separate the key from the value inside an element of an Object.	`:`
EOF	Marks that there are no tokens left (this is the last token)	-

Note

Any other combination of characters that are being found result in an error.

Parsing strategy

The lexer always parses if it is being called the characters that it is currently at. Before attempting to parse a single token it is always skipping the unnecessary whitespace until a character that isn't in the whitespace category is being viewed.

Single character tokens

The single character tokens ({, }, [, ], ,, :) can easily be matched by just checking if the character currently being viewed is one of the characters in this list. If this is the case the lexer can just move to the next token and emit the corresponding token as these tokens have a length of $1$.

Literals

The literals are a sequence of specific characters that should always be an exact match (null, true, false). As all of these tokens have a different character at the start, the current character can be checked if it is either a n, t or f and then check the following characters if they also match. If they match, then the corresponding token can be emitted.

Note

If the first character matches at the beginning, but the rest does not match, then the lexer can throw an error as there is no possibility that there could be a different token that has this schema.

EOF

If the input ends and there is nothing anymore to lex, then the EOF token must be emitted. This should be checked first and the cursor should not move to the next character as this could lead to unwanted exceptions.

String

A string always starts with a " character, can contain any character except quotation mark, reverse solidus, and the control characters (U+0000 through U+001F) inside itself and ends with another ". But some characters can be escaped with a reverse solidus (\):

Escaped	Meaning	Unicode value
`"`	Quotation Mark	U+0022
`\`	Reverse Solidus	U+005C
`/`	Solidus	U+002F
`b`	Backspace	U+0008
`f`	Form Feed	U+000C
`n`	Line Feed	U+000A
`r`	Carriage Return	U+000D
`t`	Tab	U+0009
`uXXXX`	Unicode Value	U+XXXX

The lexer will emit a string token with the value of the string where the escape sequences are already parsed.

Number

The number syntax is a little bit more complicated:

Disclaimer

KotlinJsonParser is a toy project created for educational purposes. It is not intended for production use, and while efforts have been made to ensure the code is functional, it may not cover all edge cases or be optimized for performance. Use at your own risk.

But most importantly: Have fun! :D

License:

The content of this wiki is licensed under the CC-BY-NC-SA 4.0 License.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

JSON Lexer

Why a lexer?

Tokens

Parsing strategy

Single character tokens

Literals

EOF

String

Number

Disclaimer

License:

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally