-
Notifications
You must be signed in to change notification settings - Fork 3
TUTO~1c: our first tokenizer
Yes, I've to admit: for years, I wrote tokenizers (lexers, scanners, ...) by hand.
In Ruby, this is a relatively easy task thanks to the standard StringScanner
library (more on this later).
Therefore, I decided not to include lexical analysis functionality in the grammar DSL.
What? No built-in support for lexical rules in Rley?!!!
Yep.
This might come as a surprise for those that used PEG (Parsing Expression Grammar) library like Treetop, Parslet, Citrus, ...
But Rley is not the only kid in town that makes a strict delineation between tokenizing and parsing.
For instance, the most downloaded TOML library in Ruby is the tomlrb
gem. Internally, it uses the Racc parsing library, and this parser generator assumes that the lexical analysis is performed elsewhere.
(By the way, the tokenizer -scanner- written in tomlrb
gem is very clean and demonstrates that there's nothing magic in writing your own tokenizer).
While the lack of lexical analysis capability with RGN could be seen as a shortcoming, keeping tokenizing and parsing as separate components has a number of advantages:
- Modularity,
- No "whitespace noise" in grammar rules,
- Scaleability.
Consider the following scenario: Supose TOML accepts additional string delimiters (say, like the %q or %Q string literals in Ruby). By keeping the tokenizer separate, one has just to change that part and leave the grammar and parser unchanged.
In the same vein, imagine that TOML now accepts C-style block comments /* Some comment... */, as the tokenizer is typically desgined to leave out comments and non-significant spaces, that new comment style won't affect the grammar nor the parser.
Remember the grammar rule from the official TOML documentation that specified the syntax of key-value pairs?\
Here it is again:
;; Key-Value pairs
keyval = key keyval-sep val
One may wonder what's behind the keyval-sep
symbol?
Well, the following rule:
keyval-sep = ws %x3D ws ; =
The %x3D
entry is used in ABNF to represent the character with ASCII code 0x3D, that is the '=' sign.
But what's more interesting are the two 'ws' symbols: they refer to yet another rule that stands for "zero or more whitespaces".
This kind of noise also appears in most PEG parsers because their "lexical" rules recognize whitespaces and happily propagate them to higher-level rules.
This ends up with grammars spoiled by these noisy entries put in many places.
Indeed, think of a programming language and ask yourself where whitespaces could occur in the source code.
Well, the reply is: almost everywhere. The situation is even worse if that language supports block comments, since they could occur in many places too...
With its capability to digest any context-free language, Rley was designed with potential use in NLP (Natural Language Processing) in mind.
Given the sheer complexity of lexical analysis in that domain, it would be hopeless to combine the syntax rules with the ones from (morpho)lexical analysis.
Enough self-justification, let's join the road to tokenizer implementation.
The StringScanner
class is part of the standard Ruby library. For unclear reasons, it is relatively unknown within the community.
That's rather unfortunate since it is very helpful when one has to perform analysis of formatted text files.
The irony is that communities of other programming languages are implementing similar libraries explicitly based on StringScanner
like Javascript, PHP. Python people are missing it.
Besides the standard documentation, here are some intros on StringScanner:
https://blog.appsignal.com/2019/03/05/stringscanner.html)
https://dev.to/appsignal/rubys-hidden-gems-stringscanner-44i1
https://runebook.dev/en/docs/ruby/stringscanner
https://www.geeksforgeeks.org/ruby-stringscanner-scan-function/
In the previous installment, we defined the grammar for a very limited subset of TOML.
In that grammar the following terminal symbols were declared: 'UNQUOTED-KEY', 'EQUAL', 'STRING', and 'BOOLEAN'.
So these must be recognized by the tokenizer. First requirement.
As a second requirement, comments and non-significant whitespaces should obviously be left out.
Last but not least, the parser in Rley gem expects from a tokenizer a sequence of Token
or Literal
objects.
- The
Lexical::Literal
class is a specialization of theLexical::Token
class. It embeds the additional attributevalue
.Literal
objects must be use when a token carries a data value (e.g. a string literal) or has a variable form ("spelling"). -
Lexical::Token
objects correspond to lexical elements that have a fixed form (they consist of a fixed sequence of characters).\
So for the following TOML input:
# TOML document
title = "TOML Example"
The tokenizer is expected to generate a sequence of tokens like this:
As StringScanner
uses regular expressions it may be convenient to recast the requirements in a tabular format:
token | regex | action |
---|---|---|
newline | /(?:\r\n)|\r|\n/ | increment lineno; skip |
whitespace | /[ \t\f]+/ | skip |
comment | /#[^\r\n]*/ | skip |
EQUAL | /#[=]/ | make a Token object |
STRING | /"[^"]*"/ | make a Literal object |
BOOLEAN | /true|false/ | make a Literal object |
UNQUOTED-KEY | /[A-Za-z0-9-_]+/ | make a Literal object |
Some the above regex were derived from the official TOML specification.
The source code can be found here.
The class implementation begins unsurprisingly as follows:
class TOMLTokenizer
PATT_BOOLEAN = /true|false/.freeze
PATT_COMMENT = /#[^\r\n]*/.freeze
PATT_NEWLINE = /(?:\r\n)|\r|\n/.freeze
PATT_SINGLE_CHAR = /[=]/.freeze # Single delimiter or separator character
PATT_UNQUOTED_KEY = /[A-Za-z0-9\-_]+/.freeze
PATT_WHITESPACE = /[ \t\f]+/.freeze
After these regex constants, a number of instance variables are declared. They are used for keeping track of the StringScanner object, line and column numbers.
The core processing of a TOMLTokenizer
object lies in the _next_token
private method.
Let's dive into the initial part of the method's body:
def _next_token
token = nil
until scanner.eos? || token
nl_found = scanner.skip(PATT_NEWLINE)
if nl_found
next_line
next
end
A couple of comments here:
- the
token
variable records the next token found in the the input that will be returned to the parser. - the
until
loop is necessary in order to process combination of newlines, whitespaces and comments. - if one detects a newline, we skip it, increment the line number and restart a loop iteration.
Now comes the real nitty gritty details:
token = begin
next if scanner.skip(PATT_WHITESPACE) # Skip whitespaces
curr_ch = scanner.peek(1)
if curr_ch == '#'
# Start of comment detected...
scanner.skip(PATT_COMMENT) # Skip line comment
next
elsif curr_ch == '"'
# Start of string detected...
build_string_token
elsif (lexeme = scanner.scan(PATT_SINGLE_CHAR))
build_token(@@lexeme2name[curr_ch], lexeme)
elsif (lexeme = scanner.scan(PATT_BOOLEAN))
build_token('BOOLEAN', lexeme)
elsif (lexeme = scanner.scan(PATT_UNQUOTED_KEY))
build_token('UNQUOTED-KEY', lexeme)
else # Unknown token
col = scanner.pos - @line_start + 1
_erroneous = curr_ch.nil? ? '' : scanner.scan(/./)
raise ScanError, "Error: [line #{lineno}:#{col}]: Unexpected character."
end
end # begin
end # until
After all, the code is a relatively straitforward translation of the table in the requirements section.
Only the case of the EQUAL token requires a short explanation: the class variable @@lexeme2name
refers to a Hash.
This Hash object translates the '=' character into the 'EQUAL' string. This sounds a bit overkill, but as TOML defines other single character tokens, we made the tokenizer ready for the coming extensions.
This concludes the tokenizer part, now it's time to create our parser.