Skip to content

JSON Lexer

QuickWrite edited this page Sep 3, 2024 · 1 revision

The lexer converts the raw characters (string, file, etc.) into tokens that are meaningful chunks.

Why a lexer?

Even though it is also possible to build a parser that directly parses the string, it ist in most cases way more complicated and more error prone. A lexer could be seen as a sort of preprocessor step that removes all the unnecessary whitespace and adds some more context information to groups of characters.

This is also similar to the way humans are reading text as text normally isn't read character by character, but more or less like each word itself is a token and the combination of each word has some meaning. For example: The word "quick" isn't read like 'q' 'u' 'i' 'c' 'k', but as a single word.

Tokens

JSON itself is quite minimal and so it does not need many tokens:

Token Meaning Example
NULL The value null in JSON. It is a fixed sequence of characters. null
TRUE The value true in JSON. It is a fixed sequence of characters. true
FALSE The value false in JSON. It is a fixed sequence of characters. false
STRING The string datatype delimited by " (can have escape sequences). "Some string"
NUMBER A number that can either be an integer or a floating point
(can also use scientific notation).
42
CURLY_OPEN Marks the beginnig of an Object in JSON. {
CURLY_CLOSE Marks the end of an Object in JSON. }
SQUARE_OPEN Marks the beginning of an Array in JSON. [
SQUARE_CLOSE Marks the end of an Array in JSON. ]
COMMA Used to separate different elements inside an Object or an Array. ,
COLON Used to separate the key from the value inside an element of an Object. :
EOF Marks that there are no tokens left (this is the last token) -

Note

Any other combination of characters that are being found result in an error.

Parsing strategy

The lexer always parses if it is being called the characters that it is currently at. Before attempting to parse a single token it is always skipping the unnecessary whitespace until a character that isn't in the whitespace category is being viewed.

Single character tokens

The single character tokens ({, }, [, ], ,, :) can easily be matched by just checking if the character currently being viewed is one of the characters in this list. If this is the case the lexer can just move to the next token and emit the corresponding token as these tokens have a length of $1$.

Literals

The literals are a sequence of specific characters that should always be an exact match (null, true, false). As all of these tokens have a different character at the start, the current character can be checked if it is either a n, t or f and then check the following characters if they also match. If they match, then the corresponding token can be emitted.

Note

If the first character matches at the beginning, but the rest does not match, then the lexer can throw an error as there is no possibility that there could be a different token that has this schema.

EOF

If the input ends and there is nothing anymore to lex, then the EOF token must be emitted. This should be checked first and the cursor should not move to the next character as this could lead to unwanted exceptions.

String

A string always starts with a " character, can contain any character except quotation mark, reverse solidus, and the control characters (U+0000 through U+001F) inside itself and ends with another ". But some characters can be escaped with a reverse solidus (\):

Escaped Meaning Unicode value
" Quotation Mark U+0022
\ Reverse Solidus U+005C
/ Solidus U+002F
b Backspace U+0008
f Form Feed U+000C
n Line Feed U+000A
r Carriage Return U+000D
t Tab U+0009
uXXXX Unicode Value U+XXXX

The lexer will emit a string token with the value of the string where the escape sequences are already parsed.

Number

The number syntax is a little bit more complicated:

JSON Parsing of a number
Clone this wiki locally