Skip to content

02. What is a tokenizer?

famished-tiger edited this page Nov 19, 2021 · 14 revisions

A tokenizer (aka scanner, lexical analyzer, lexer) is a component that performs the lexical analysis of the input source text.
That is, it:

  1. Reads the input string of characters and groups these characters into atomic units meaningful in the language under consideration. These units are called tokens,
  2. Leaves aside some input characters such as space, tabs, newlines or some text parts (e.g. comment text),
  3. Attaches data (like the position of the piece of source text) to tokens so that the parser can report user-friendly error messages.
  4. Performs some normalization / formatting of the input values.

The last point is often overlooked in most compiler/interpreter textbooks. To make it clear, consider the following Ruby variable assignments:

some_var = 0377       # Octal notation
some_var = 0b11111111 # Binary notation
some_var = 0xFF       # Hexadecimal notation

Although each line is distinct from its siblings from a lexical point of view (i.e. the characters that built up the assignment expressions are different), these expressions are identical since the assigned value stays the same (255 in decimal notation). It is the role of the tokenizer to cope with the input variants and to deliver an output in a standard internal representation.
As another example, consider the case of a case insensitive language (i.e. that doesn't differentiate capital and small letters, like Basic or Pascal). In these languages the variable myvar can also equally be written as Myvar, MyVar, MYVAR or any lower/uppercase mix.
In such scenario, it makes sense that a tokenizer converts every identifier it encounters into one single format (say, all lowercase). This simplifies the subsequent processing stages (simpler string comparison or search in Hash with string keys).

What is then a token?

A token is a data structure that represents one or more characters from the input source that has an atomic meaning in the language at hand.
That data structure typically holds:

  • The category (token type) to which the raw text belongs,
  • The raw substring from the input (often called the lexeme in compiler books)
  • Optionally, the position of that raw substring within the source (for instance, the line and column numbers),
  • Optionally, the -normalized- value of the raw substring (if the language attaches a value to the substring).

Let's assume that we wrote a scanner for the Ruby language in Ruby itself. For the expression:

some_var = 0b11111111

a simple tokenizer could return a sequence of tokens similar to:

[
  [:IDENTIFIER, 'some_var'],
  [:EQUAL, '='],
  [:INTEGER, '0b11111111']
]

Here, the tokens are implemented by couples (two-elements arrays). Here, the first element is the token type, implemented with a Ruby symbol. The second item is the raw text counterpart (the "lexeme"). Of course, other, more sophisticated tokenizers could be devised.
In fact, Rley expects the tokenizer to return a sequence of Rley::Lexical::Token objects.
So the expected tokenizer output for the above expression can be depicted like:

In a Token instance, the lexeme* attribute holds the matching raw text from the input that matches the token "category" as mentioned in attribute terminal. The adjective 'terminal' is a shorthand for 'terminal symbol', a terminology in use when defining context-free grammars.
Token instances have also a position attribute. Integer literals have an additional attribute, named value.

* To add to the confusion, in linguistics and in natural language processing, the lexeme word has a very different meaning than in the computer language community.