-
Notifications
You must be signed in to change notification settings - Fork 3
02. What is a tokenizer?
A tokenizer (aka scanner, lexical analyzer, lexer) is a component that performs the lexical analysis of the input source text.
That is, it:
- Reads the input string of characters and groups these characters into atomic units meaningful in the language under consideration. These units are called tokens,
- Leaves aside some input characters such as space, tabs, newlines or some text parts (e.g. comment text),
- Attaches data (like the position of the piece of source text) to tokens so that the parser can report user-friendly error messages.
- Performs some normalization / formatting of the input values.
The last point is often overlooked in most compiler/interpreter textbooks. To make it clear, consider the following Ruby variable assignments:
some_var = 0377 # Octal notation
some_var = 0b11111111 # Binary notation
some_var = 0xFF # Hexadecimal notation
Although each line is distinct from its siblings from a lexical point of view (i.e. the characters that built up the assignment expressions are different), these expressions are identical since the assigned value stays the same (255 in decimal notation). It is the role of the tokenizer to cope with the input variants and to deliver an output in a standard internal representation.
As another example, consider the case of a case insensitive language (i.e. that doesn't differentiate capital and small letters, like Basic or Pascal). In these languages the variable myvar
can also equally be written as Myvar
, MyVar
, MYVAR
or any lower/uppercase mix.
In such scenario, it makes sense that a tokenizer converts every identifier it encounters into one single format (say, all lowercase). This simplifies the subsequent processing stages (simpler string comparison or search in Hash with string keys).
A token is a data structure that represents one or more characters from the input source that has an atomic meaning in the language at hand.
That data structure typically holds:
- The category (token type) to which the raw text belongs,
- The raw substring from the input (often called the
lexeme
in compiler books) - Optionally, the position of that raw substring within the source (for instance, the line and column numbers),
- Optionally, the -normalized- value of the raw substring (if the language attaches a value to the substring).
Let's assume that we wrote a scanner for the Ruby language in Ruby itself. For the expression:
some_var = 0b11111111
a simple tokenizer could return a sequence of tokens similar to:
[
[:IDENTIFIER, 'some_var'],
[:EQUAL, '='],
[:INTEGER, '0b11111111']
]
Here, the tokens are implemented by couples (two-elements arrays). Here, the first element is the token type, implemented with a Ruby symbol. The second item is the raw text counterpart (the "lexeme").
Of course, other, more sophisticated tokenizers could be devised.
In fact, Rley
expects the tokenizer to return a sequence of Rley::Lexical::Token
objects.
So the expected tokenizer output for the above expression can be depicted like:
In a Token
instance, the lexeme
* attribute holds the matching raw text from the input that matches the token "category" as mentioned in attribute terminal
. The adjective 'terminal' is a shorthand for 'terminal symbol', a terminology in use when defining context-free grammars.
Token
instances have also a position
attribute. Integer literals have an additional attribute, named value
.
* To add to the confusion, in linguistics and in natural language processing, the lexeme
word has a very different meaning than in the computer language community.