Skip to content

TUTO~1c: our first tokenizer

famished-tiger edited this page Feb 15, 2022 · 1 revision

Keep it separate...

Yes, I've to admit: for years, I wrote tokenizers (lexers, scanners, ...) by hand.
In Ruby, this is a relatively easy task thanks to the standard StringScanner library (more on this later).
Therefore, I decided not to include lexical analysis functionality in the grammar DSL.

What? No built-in support for lexical rules in Rley?!!!
Yep.
This might come as a surprise for those that used PEG (Parsing Expression Grammar) library like Treetop, Parslet, Citrus, ...
But Rley is not the only kid in town that makes a strict delineation between tokenizing and parsing.
For instance, the most downloaded TOML library in Ruby is the tomlrb gem. Internally, it uses the Racc parsing library, and this parser generator assumes that the lexical analysis is performed elsewhere.
(By the way, the tokenizer -scanner- written in tomlrb gem is very clean and demonstrates that there's nothing magic in writing your own tokenizer).

While the lack of lexical analysis capability with RGN could be seen as a shortcoming, keeping tokenizing and parsing as separate components has a number of advantages:

  • Modularity,
  • No "whitespace noise" in grammar rules,
  • Scaleability.

Modularity

Consider the following scenario: Supose TOML accepts additional string delimiters (say, like the %q or %Q string literals in Ruby). By keeping the tokenizer separate, one has just to change that part and leave the grammar and parser unchanged.
In the same vein, imagine that TOML now accepts C-style block comments /* Some comment... */, as the tokenizer is typically desgined to leave out comments and non-significant spaces, that new comment style won't affect the grammar nor the parser.

No "whitespace noise" issue

Remember the grammar rule from the official TOML documentation that specified the syntax of key-value pairs?\

Here it is again:

;; Key-Value pairs

keyval = key keyval-sep val

One may wonder what's behind the keyval-sep symbol?
Well, the following rule:

 keyval-sep = ws %x3D ws ; =

The %x3D entry is used in ABNF to represent the character with ASCII code 0x3D, that is the '=' sign.
But what's more interesting are the two 'ws' symbols: they refer to yet another rule that stands for "zero or more whitespaces". This kind of noise also appears in most PEG parsers because their "lexical" rules recognize whitespaces and happily propagate them to higher-level rules. This ends up with grammars spoiled by these noisy entries put in many places.
Indeed, think of a programming language and ask yourself where whitespaces could occur in the source code.
Well, the reply is: almost everywhere. The situation is even worse if that language supports block comments, since they could occur in many places too...

Scaleability

With its capability to digest any context-free language, Rley was designed with potential use in NLP (Natural Language Processing) in mind.
Given the sheer complexity of lexical analysis in that domain, it would be hopeless to combine the syntax rules with the ones from (morpho)lexical analysis.

Enough self-justification, let's join the road to tokenizer implementation.

The StringScanner library

The StringScanner class is part of the standard Ruby library. For unclear reasons, it is relatively unknown within the community. That's rather unfortunate since it is very helpful when one has to perform analysis of formatted text files. The irony is that communities of other programming languages are implementing similar libraries explicitly based on StringScanner like Javascript, PHP. Python people are missing it.

Besides the standard documentation, here are some intros on StringScanner:
https://blog.appsignal.com/2019/03/05/stringscanner.html)
https://dev.to/appsignal/rubys-hidden-gems-stringscanner-44i1
https://runebook.dev/en/docs/ruby/stringscanner
https://www.geeksforgeeks.org/ruby-stringscanner-scan-function/

Requirements for the tokenizer

In the previous installment, we defined the grammar for a very limited subset of TOML.
In that grammar the following terminal symbols were declared: 'UNQUOTED-KEY', 'EQUAL', 'STRING', and 'BOOLEAN'.
So these must be recognized by the tokenizer. First requirement.
As a second requirement, comments and non-significant whitespaces should obviously be left out.
Last but not least, the parser in Rley gem expects from a tokenizer a sequence of Token or Literal objects.

  • The Lexical::Literal class is a specialization of the Lexical::Token class. It embeds the additional attribute value. Literal objects must be use when a token carries a data value (e.g. a string literal) or has a variable form ("spelling").
  • Lexical::Token objects correspond to lexical elements that have a fixed form (they consist of a fixed sequence of characters).\

So for the following TOML input:

  # TOML document

  title = "TOML Example"

The tokenizer is expected to generate a sequence of tokens like this:

As StringScanner uses regular expressions it may be convenient to recast the requirements in a tabular format:

token regex action
newline /(?:\r\n)|\r|\n/ increment lineno; skip
whitespace /[ \t\f]+/ skip
comment /#[^\r\n]*/ skip
EQUAL /#[=]/ make a Token object
STRING /"[^"]*"/ make a Literal object
BOOLEAN /true|false/ make a Literal object
UNQUOTED-KEY /[A-Za-z0-9-_]+/ make a Literal object

Some the above regex were derived from the official TOML specification.

The TOMLTokenizer class

The source code can be found here.

The class implementation begins unsurprisingly as follows:

class TOMLTokenizer
  PATT_BOOLEAN = /true|false/.freeze
  PATT_COMMENT = /#[^\r\n]*/.freeze
  PATT_NEWLINE = /(?:\r\n)|\r|\n/.freeze
  PATT_SINGLE_CHAR = /[=]/.freeze # Single delimiter or separator character
  PATT_UNQUOTED_KEY = /[A-Za-z0-9\-_]+/.freeze
  PATT_WHITESPACE = /[ \t\f]+/.freeze

After these regex constants, a number of instance variables are declared. They are used for keeping track of the StringScanner object, line and column numbers.

The core processing of a TOMLTokenizer object lies in the _next_token private method.
Let's dive into the initial part of the method's body:

  def _next_token
    token = nil

    until scanner.eos? || token
      nl_found = scanner.skip(PATT_NEWLINE)
      if nl_found
        next_line
        next
      end

A couple of comments here:

  • the token variable records the next token found in the the input that will be returned to the parser.
  • the until loop is necessary in order to process combination of newlines, whitespaces and comments.
  • if one detects a newline, we skip it, increment the line number and restart a loop iteration.

Now comes the real nitty gritty details:

      token = begin
        next if scanner.skip(PATT_WHITESPACE) # Skip whitespaces

        curr_ch = scanner.peek(1)

        if curr_ch == '#'
          # Start of comment detected...
          scanner.skip(PATT_COMMENT) # Skip line comment
          next
        elsif curr_ch == '"'
          # Start of string detected...
          build_string_token
        elsif (lexeme = scanner.scan(PATT_SINGLE_CHAR))
          build_token(@@lexeme2name[curr_ch], lexeme)
        elsif (lexeme = scanner.scan(PATT_BOOLEAN))
          build_token('BOOLEAN', lexeme)
        elsif (lexeme = scanner.scan(PATT_UNQUOTED_KEY))
          build_token('UNQUOTED-KEY', lexeme)
        else # Unknown token
          col = scanner.pos - @line_start + 1
          _erroneous = curr_ch.nil? ? '' : scanner.scan(/./)
          raise ScanError, "Error: [line #{lineno}:#{col}]: Unexpected character."
        end
      end # begin
    end # until

After all, the code is a relatively straitforward translation of the table in the requirements section.
Only the case of the EQUAL token requires a short explanation: the class variable @@lexeme2name refers to a Hash. This Hash object translates the '=' character into the 'EQUAL' string. This sounds a bit overkill, but as TOML defines other single character tokens, we made the tokenizer ready for the coming extensions.

This concludes the tokenizer part, now it's time to create our parser.