Skip to content

TUTO~1a: our first grammar

famished-tiger edited this page Feb 15, 2022 · 10 revisions

What's in the box?

“There are no solutions. There are only trade-offs.”
Thomas Sowell

In this first iteration, we'll learn how:

  • Create a grammar in Rley,
  • Write a first tokenizer (aka lexer, scanner); then,
  • Build a parser and visualize its results.

As this is quite some ground to cover, something has to yield. The trade-off is that we'll deal first with a very limited subset of TOML.

Our objective regarding TOML in this iteration will be quite modest: to be able to parse the (very) simple TOML document:

# This is a TOML document

title = "TOML Example"
enabled = true

In other words, we retain for the moment a couple of TOML features and leave the rest - ahem ... that's most of the language - out:
✅ Comments
✅ Key-value pairs (single)
✅ Boolean type
✅ Basic strings without escape chars
❌ Key-value pairs (dotted)
❌ Table sections (single and dotted)
❌ Integer type
❌ Float type
❌ String type (escaped, multi-line,...)
❌ Float type
❌ Simple Arrays
❌ Arrays: nested; multiline; of Different Types
❌ Inline Tables
❌ Array of Tables
❌ Offset Date-Time, Local: Date-Time; Date; Time

Before we start...

... The source code for this first iteration is located here.

Initial grammar

On paper, one could build a Rley grammar by hand with provided classes like Grammar, Production, NonTerminal and Terminal.
In practice, it is much more convenient to rely on the Builder design pattern to create our first grammar.
With that option in the pocket, the grammar construction becomes essentially a two-step process:

  1. Create a builder object with the Rley::grammar_builder method and provide it a block code that specifies the grammar to generate...
  2. ... then ask the builder object to generate the desired grammar.

So here's our first TOML grammar (construction):

require 'rley'

builder = Rley::grammar_builder do
  # Define first the terminal symbols...
  add_terminals('UNQUOTED-KEY', 'EQUAL', 'STRING')
  add_terminals('BOOLEAN')

  # ... then with syntax rules
  # First found rule is considered to be the top-level rule
  rule('toml' => 'expr-list')
  rule('expr-list' => 'expr-list expression')
  rule('expr-list' => '')
  rule('expression' => 'keyval')
  rule('keyval' => 'key EQUAL val')
  rule('key' => 'UNQUOTED-KEY')
  rule('val' => 'STRING')
  rule('val' => 'BOOLEAN')
end

# Let's build a TOML grammar object
TOMLGrammar = builder.grammar.freeze

The DSL in the block relies on two principles:

  1. Terminal symbols are declared first with the add_terminals method before one specifies syntax rules;
  2. The first rule is considered as the main (i.e. top-level) rule.

Side note: we gave the DSL a pompuous name: the Rley Grammar Notation, RGN in short.

Declaring terminal symbols

Confusion Alert: Do not confuse the phrase "(non)terminal symbol" with a Ruby symbol!
Here, the terminology comes from the formal language theory and a symbol is a basic unit appearing in grammar rules.

The add_terminals method can take a variable number of arguments.
So, the following line code:

  add_terminals('UNQUOTED-KEY', 'EQUAL', 'STRING')

could be equally be rewritten like this:

  add_terminals('UNQUOTED-KEY')
  add_terminals('EQUAL')
  add_terminals('STRING')

Good to know:

  • The arguments of add_terminals method are String objects, each one being a name of a terminal grammar symbol.
  • Terminal symbols can be declared in any order (i.e. Rley doesn't impose some ordering)
  • Terminal symbols can be any combination of upper or lower case characters. In the above example, we took the convention to write the terminal symbols with capital letters. There is nothing compulsory there, the purpose of this convention is to create a visual cue that helps us to distinguish between terminals and non-terminals. Feel free to follow any other style. Note however, like Ruby, Rley is case sensitive.
  • Any symbol that occurs in a grammar rule and that was not declared previously with add_terminals, is categorized as a non-terminal.
  • I tried to use as much as possible the symbol names from the official TOML specification.

Specifying grammar rules

Grammar/syntax rules are declared with the rule method which takes a Hash, and more precisely, a key-value pair.
The convention to use here is that:

  • The key (a String) represents the head of a production rule,
  • The value (a String) represents the body of that production rule. It consists of a (possibly empty) sequence of grammar symbols.

To make things clearer, let's take an example. In the TOML grammar from the official specification, there is a rule that determines the syntax of a TOML key/value pair:

;; Key-Value pairs

keyval = key keyval-sep val

Note: The official documentation uses the ABNF format defined in RFC 5234.
How would that same rule be recast in RGN (Rley Grammar Notation)?

  # Key-Value pairs defined in RGN
  rule('keyval' => 'key keyval-sep val')

In fact, that rule is present in our grammar (with one symbol name difference):

  rule('keyval' => 'key EQUAL val')

That rule can be translated informally in English as follows:
The syntax of a key/value pair in TOML consists of a key followed by an equal sign followed by a value.

What, no or-bar (|) !!!?

Looking back to the two last syntax rules from our grammar:

  rule('val' => 'STRING')
  rule('val' => 'BOOLEAN')

one detects that they have the same head ('val').
This is the way in RGN to specify alternative rules. Here, for instance, the two rules can be read as "a value can be a basic string or a boolean".
In other words, RGN doesn't have the or-bar (|) shorthand notation.
This results in less compact, more verbose grammar specifications than for other parsing tools.
But as we will learn later, there are very good reasons to stick to a simple, uniform syntax for rule declarations.

Recursive grammar rules

The alert reader had surely spotted another occurrence of two OR-ed rules:

  rule('expr-list' => 'expr-list expression')
  rule('expr-list' => '')

That's correct. What makes these rules interesting are the following facts:

  • The first rule is a left-recursive rule: the symbol being defined (expr-list) appears in the rule body.
  • The second rule indicates that it succeeds without consuming an input token.

These two rules form a frequent pattern used to implement a repetition of zero or more elements.
Therefore, they can be translated in plain English as "an expression list consists of zero or more expressions".
As we will discover later, RGN provides a convenient shorthand notation for repetition.

Recap

To create a grammar, use the Rley::grammar_builder method. It takes a block code as argument and returns a GrammarBuilder object.

builder = Rley::grammar_builder do
  add_terminals('BOOLEAN', ...)
  ...

  rule('toml' => 'expr-list')
  ...
end

Use the RGN (Rley Grammar Notation) DSL inside the block code.
Basically, RGN consists of two methods: add_terminals and rule:

  1. add_terminals takes one or more String arguments. Each argument represents the name of a terminal symbol.
  2. rule accepts a key/value pair. The key (a String) is the head of the rule (i.e. the symbol being defined) and the value is a String consisting of a sequence of grammar symbols.

In addition, RGN relies on two conventions:

  1. Terminal symbols are declared first with add_terminals before any call to the rule method.
  2. The first rule is the top-level (main) rule.

Once, the GrammarBuilder object is instantiated, it can be used to create, in turn, the grammar (object):

  my_grammar = builder.grammar # Returns a Rley::Syntax::Grammar object