-
Notifications
You must be signed in to change notification settings - Fork 3
TUTO~1d: our first parser
Don't tell anyone. Contrary to what you might think, the Parser::GFGEarleyParser
class isn't a parser!
It implements the Earley parsing algorithm which is basically a recognizer. A recognizer, tells you whether or not a given input is syntactically valid.
(By the way, most introductions on Earley parsing on the Internet or in compiler textbooks describe the recognizer part only).
A parser goes a step beyond, not only it checks whether the input complies to the grammar rules, it builds a useful data representation of the parse result (mostly a parse tree).
Here is the link to the source file of a demo recognizer. It reads as follows:
require 'rley'
require_relative 'toml_grammar'
require_relative 'toml_tokenizer'
# Sample TOML document to parse
toml_doc = <<-TOML
# This is a TOML document
title = "TOML Example"
enabled = true
TOML
recognizer = Rley::Parser::GFGEarleyParser.new(TOMLGrammar)
tokenizer = TOMLTokenizer.new(toml_doc)
result = recognizer.parse(tokenizer.tokens)
p result.success?
If you run the script in the same directory where the files toml_grammar.rb
and toml_tokenizer.rb
reside, you'll get a minimal output:
$ ruby recognizer_demo.rb
$ true
For the curious, the result
variable in the above script is assigned a Parser::GFGParsing
instance.
That, in turn, contains a Parser::GFGChart
object; such a chart object can become very complex for long input and its manipulation requires a deep understanding of the Earley algorithm.
What we need to get a real parser is a component that translates the chart object into a parse tree.
This is a responsibility of the class Rley::Engine
: it acts as a facade interface that simplifies the use of the lower-level Rley classes.
So we are ready to implement the parser.
The TOMLParser
class defines two methods only.
First, the constructor:
def initialize
# Create a Rley facade object
@engine = Rley::Engine.new do |cfg|
cfg.diagnose = true
end
# Step 1. Load Lox grammar
@engine.use_grammar(TOMLGrammar)
end
It merely initializes an Engine
object, then it tells which grammar to use...
The second method, does the much demanded parsing:
def parse(source)
tokenizer = TOMLTokenizer.new(source)
result = engine.parse(tokenizer.tokens)
unless result.success?
# Stop if the parse failed...
line1 = "Parsing failed\n"
line2 = "Reason: #{result.failure_reason.message}"
raise SyntaxError, line1 + line2
end
engine.convert(result)
end
Except for the error handling, the method bears some resemblance with the demo script of the recognizer.
But the real magic occurs with the call engine.convert(result)
: this is the step where the GFGParsing
object is translated into a parse tree.
The file toml_demo.rb
is little script that visualizes the generated parse tree:
require_relative 'toml_parser'
# Sample TOML document to parse
toml_doc = <<-TOML
# This is a TOML document
title = "TOML Example"
enabled = true
TOML
parser = TOMLParser.new
ptree = parser.parse(toml_doc)
# Let's create a parse tree visitor
visitor = parser.engine.ptree_visitor(ptree)
# Let's create a formatter that will render the parse tree with characters
renderer = Rley::Formatter::Asciitree.new($stdout)
# Subscribe the formatter to the visitor's event and launch the visit
renderer.render(visitor)
The last three expressions in the script are all that is needed to visualize the parse tree:
- One creates a parse tree visitor;
- A formatter object is told to emit its ASCII rendering output to
stdout
; - The formatter generates the output as a side-effect of the visit of the parse tree by the visitor.
This time, the output should look as follows:
toml
+-- expr-list
+-- expr-list
| +-- expr-list
| +-- expression
| +-- keyval
| +-- key
| | +-- UNQUOTED-KEY: 'title'
| +-- EQUAL: '='
| +-- val
| +-- STRING: '"TOML Example"'
+-- expression
+-- keyval
+-- key
| +-- UNQUOTED-KEY: 'enabled'
+-- EQUAL: '='
+-- val
+-- BOOLEAN: 'true'
The parse trees that Rley generates by default (i.e. without customization) is called a Concrete Syntax Tree (CST).
A CST is a tree-like data structure. It is a representation of the input string structured along the grammar rules used by the parser during its syntactic analysis.
The following class diagram displays the classes with their relationships that are used in implementing CSTs.
In plain English:
- The
ParseTree
class acts as a facade object that merely points to the root node of the CST parse tree. - The
TerminalNode
class represents the leaf nodes of the tree. As its name implies, this kind of node is associated with terminal symbols - The
NonTerminalNode
class represents the parent nodes of the tree. This kind of node is associated with non-terminal symbols. The child nodes - subnodes - correspond to the right-hand symbols of the rule used to parse the parent node. - Both
TerminalNode
andNonTerminalNode
are both subclasses of theParseTreeNode
class. - All the displayed classes have an
accept
method that is used to implement the Visitor design pattern.
After a successful parse, Rley walks over the raw parse results embodied in a Parser::GFGParsing
object.
The "walking" proceeds backwards from the success end state and examines the predecessors of that state, and so on...
While walking from one entry to another, a walk event is emitted and delivered to one (or more) subscribers (Publisher-Subscriber pattern).
One such subscriber - selected by default by Rley - is a ParseRep::CSTBuilder
instance.
The mission of this object is to construct the Concrete Syntax Tree and the construction process is explained now.
As mentioned in the previous paragraph, Rley examines the raw parse results backwards, that is, from the end state.
This rearward walking direction has two implications for the CSTBuilder
object:
- The right-hand side of production rules are processed first, then the left-hand side;
- The members of a right-hand side are visited from right-to-left.
The CST construction algorithm can be outlined with the following pseudo-code:
# Here starts the CST construction
# Assumption: variable `top_rule` refers to the top-rule of the grammar
subnodes := process_rhs_of(top_rule)
root := NonTerminalNode.new(top_rule.lhs, subnodes)
# Given a production rule, convert each member from right-hand side to a CST node...
# ... and return an array of CST nodes
function process_rhs_of(rule)
result := []
for_each member from rule.rhs do
if member corresponds to a terminal symbol
node := TerminalNode.new(member.symbol, member.token)
else # ... parse entry corresponding to a non-terminal symbol
subnodes := process_rhs_of(element.production) # Recursive call
node := NonTerminalNode.new(element.production.lhs, subnodes)
end
result := prepend node to result # 'prepend' because we walk right-to-left
end for
return result
end function
For the sake of conciseness, we'll trim down our TOML grammar to a tiny subset:
builder = Rley::grammar_builder do
add_terminals('UNQUOTED-KEY', 'EQUAL', 'STRING')
rule 'keyval' => 'key EQUAL val'
rule 'key' => 'UNQUOTED-KEY'
rule 'val' => 'STRING'
end
...
Let's assume that the parsed input string is:
title = "TOML Example"
The top-level rule in use by the algorithm is:
rule 'keyval' => 'key EQUAL val'
Inside the first call to process_rhs_of()
, one iterates over the members of the rhs of the rule (that is, key EQUAL val
).
The rightmost member corresponds to the non-terminal symbol val
.
Looking at the pseudo-code, one sees that a call to process_rhs_of(rule val => 'STRING')
is initiated.
Let's dive into that nested call:
- An iteration loop over the right-hand side members begins. There is only one member:
STRING
, a terminal symbol. - Therefore the function
process_rhs_of
returns a single element array, aTerminalNode
that refers to theSTRING
symbol and has the string value "TOML Example".
We're back inside the first call to process_rhs_of()
.\
The next step of the algorithm is to construct a NonTerminalNode for the val
symbol with the STRING terminal as its sole sub-node.
SO, the following CST-subtree input in the result
array:
The iteration loop over the right-side members proceeds further: it reaches the symbol EQUAL
, which is a terminal.
Hence a TerminalNode
is created.
The last (= leftmost) member is the key
symbol, a non-terminal. Looking back to the tiny grammar, one sees that we have a similar situation than the processing of the val
symbol.
In other words, after the loop, the result
array contains the three following elements:
Finally, the last step: a NonTerminalNode is created for the keyval
symbol and the elements of the result array are treated as its subnodes.
The resulting CST is depicted below:
On the positive side:
- Rley constructs automatically Concrete Syntax Trees
- Limited number of classes: CSTs consist essentialy of
TerminalNode
andNonTerminalNode
instances. - Out-of-the box parse tree visualization
On the negative side:
- No customization of generated CSTs
- CSTs follow very closely the grammar structure. This leads to deeply nested tree structure that is also brittle against grammar changes.
So, what are the CST good for? I use them in early phases of a parser implementation for testing purposes.
This concludes the first iteration. We learned how create a grammar (of a small subset of TOML), a tokenizer (written by hand), then a parser.
In this installment, we learned about Concreted Syntax Trees. Rley is capable to generate them automatically from the parse results.
In the next iteration, we will expand our parser to cover more features of TOML language.