-
Notifications
You must be signed in to change notification settings - Fork 3
TUTO~1a: our first grammar
“There are no solutions. There are only trade-offs.”
Thomas Sowell
In this first iteration, we'll learn how:
- Create a grammar in Rley,
- Write a first tokenizer (aka lexer, scanner); then,
- Build a parser and visualize its results.
As this is quite some ground to cover, something has to yield. The trade-off is that we'll deal first with a very limited subset of TOML.
Our objective regarding TOML in this iteration will be quite modest: to be able to parse the (very) simple TOML document:
# This is a TOML document
title = "TOML Example"
enabled = true
In other words, we retain for the moment a couple of TOML features and leave the rest - ahem ... that's most of the language - out:
✅ Comments
✅ Key-value pairs (single)
✅ Boolean type
✅ Basic strings without escape chars
❌ Key-value pairs (dotted)
❌ Table sections (single and dotted)
❌ Integer type
❌ Float type
❌ String type (escaped, multi-line,...)
❌ Float type
❌ Simple Arrays
❌ Arrays: nested; multiline; of Different Types
❌ Inline Tables
❌ Array of Tables
❌ Offset Date-Time, Local: Date-Time; Date; Time
... The source code for this first iteration is located here.
On paper, one could build a Rley
grammar by hand with provided classes like Grammar
, Production
, NonTerminal
and Terminal
.
In practice, it is much more convenient to rely on the Builder design pattern to create our first grammar.
With that option in the pocket, the grammar construction becomes essentially a two-step process:
- Create a builder object with the
Rley::grammar_builder
method and provide it a block code that specifies the grammar to generate... - ... then ask the builder object to generate the desired grammar.
So here's our first TOML grammar (construction):
require 'rley'
builder = Rley::grammar_builder do
# Define first the terminal symbols...
add_terminals('UNQUOTED-KEY', 'EQUAL', 'STRING')
add_terminals('BOOLEAN')
# ... then with syntax rules
# First found rule is considered to be the top-level rule
rule('toml' => 'expr-list')
rule('expr-list' => 'expr-list expression')
rule('expr-list' => '')
rule('expression' => 'keyval')
rule('keyval' => 'key EQUAL val')
rule('key' => 'UNQUOTED-KEY')
rule('val' => 'STRING')
rule('val' => 'BOOLEAN')
end
# Let's build a TOML grammar object
TOMLGrammar = builder.grammar.freeze
The DSL in the block relies on two principles:
- Terminal symbols are declared first with the
add_terminals
method before one specifies syntax rules; - The first rule is considered as the main (i.e. top-level) rule.
Side note: we gave the DSL a pompuous name: the Rley Grammar Notation, RGN in short.
Confusion Alert: Do not confuse the phrase "(non)terminal symbol" with a Ruby symbol!
Here, the terminology comes from the formal language theory and a symbol is a basic unit appearing in grammar rules.
The add_terminals
method can take a variable number of arguments.
So, the following line code:
add_terminals('UNQUOTED-KEY', 'EQUAL', 'STRING')
could be equally be rewritten like this:
add_terminals('UNQUOTED-KEY')
add_terminals('EQUAL')
add_terminals('STRING')
Good to know:
- The arguments of
add_terminals
method are String objects, each one being a name of a terminal grammar symbol. - Terminal symbols can be declared in any order (i.e. Rley doesn't impose some ordering)
- Terminal symbols can be any combination of upper or lower case characters. In the above example, we took the convention to write the terminal symbols with capital letters. There is nothing compulsory there, the purpose of this convention is to create a visual cue that helps us to distinguish between terminals and non-terminals. Feel free to follow any other style. Note however, like Ruby, Rley is case sensitive.
- Any symbol that occurs in a grammar rule and that was not declared previously with
add_terminals
, is categorized as a non-terminal. - I tried to use as much as possible the symbol names from the official TOML specification.
Grammar/syntax rules are declared with the rule
method which takes a Hash, and more precisely, a key-value pair.
The convention to use here is that:
- The key (a String) represents the head of a production rule,
- The value (a String) represents the body of that production rule. It consists of a (possibly empty) sequence of grammar symbols.
To make things clearer, let's take an example. In the TOML grammar from the official specification, there is a rule that determines the syntax of a TOML key/value pair:
;; Key-Value pairs
keyval = key keyval-sep val
Note: The official documentation uses the ABNF format defined in RFC 5234.
How would that same rule be recast in RGN (Rley Grammar Notation)?
# Key-Value pairs defined in RGN
rule('keyval' => 'key keyval-sep val')
In fact, that rule is present in our grammar (with one symbol name difference):
rule('keyval' => 'key EQUAL val')
That rule can be translated informally in English as follows:
The syntax of a key/value pair in TOML consists of a key followed by an equal sign followed by a value.
Looking back to the two last syntax rules from our grammar:
rule('val' => 'STRING')
rule('val' => 'BOOLEAN')
one detects that they have the same head ('val').
This is the way in RGN to specify alternative rules. Here, for instance, the two rules can be read as "a value can be a basic string or a boolean".
In other words, RGN doesn't have the or-bar (|) shorthand notation.
This results in less compact, more verbose grammar specifications than for other parsing tools.
But as we will learn later, there are very good reasons to stick to a simple, uniform syntax for rule declarations.
The alert reader had surely spotted another occurrence of two OR-ed rules:
rule('expr-list' => 'expr-list expression')
rule('expr-list' => '')
That's correct. What makes these rules interesting are the following facts:
- The first rule is a left-recursive rule: the symbol being defined (expr-list) appears in the rule body.
- The second rule indicates that it succeeds without consuming an input token.
These two rules form a frequent pattern used to implement a repetition of zero or more elements.
Therefore, they can be translated in plain English as "an expression list consists of zero or more expressions".
As we will discover later, RGN provides a convenient shorthand notation for repetition.
To create a grammar, use the Rley::grammar_builder
method. It takes a block code as argument and returns a GrammarBuilder
object.
builder = Rley::grammar_builder do
add_terminals('BOOLEAN', ...)
...
rule('toml' => 'expr-list')
...
end
Use the RGN (Rley Grammar Notation) DSL inside the block code.
Basically, RGN consists of two methods: add_terminals
and rule
:
-
add_terminals
takes one or more String arguments. Each argument represents the name of a terminal symbol. -
rule
accepts a key/value pair. Thekey
(a String) is the head of the rule (i.e. the symbol being defined) and thevalue
is a String consisting of a sequence of grammar symbols.
In addition, RGN relies on two conventions:
- Terminal symbols are declared first with
add_terminals
before any call to therule
method. - The first rule is the top-level (main) rule.
Once, the GrammarBuilder
object is instantiated, it can be used to create, in turn, the grammar (object):
my_grammar = builder.grammar # Returns a Rley::Syntax::Grammar object