Skip to content

TUTO~1d: our first parser

famished-tiger edited this page Feb 15, 2022 · 1 revision

From recognizer to parser

It's a recognizer!

Don't tell anyone. Contrary to what you might think, the Parser::GFGEarleyParser class isn't a parser!
It implements the Earley parsing algorithm which is basically a recognizer. A recognizer, tells you whether or not a given input is syntactically valid.
(By the way, most introductions on Earley parsing on the Internet or in compiler textbooks describe the recognizer part only).
A parser goes a step beyond, not only it checks whether the input complies to the grammar rules, it builds a useful data representation of the parse result (mostly a parse tree).

Running the recognizer

Here is the link to the source file of a demo recognizer. It reads as follows:

require 'rley'
require_relative 'toml_grammar'
require_relative 'toml_tokenizer'

# Sample TOML document to parse
toml_doc = <<-TOML
  # This is a TOML document

  title = "TOML Example"
  enabled = true
TOML

recognizer = Rley::Parser::GFGEarleyParser.new(TOMLGrammar)
tokenizer = TOMLTokenizer.new(toml_doc)
result = recognizer.parse(tokenizer.tokens)
p result.success?

If you run the script in the same directory where the files toml_grammar.rb and toml_tokenizer.rb reside, you'll get a minimal output:

$ ruby recognizer_demo.rb
$ true

For the curious, the result variable in the above script is assigned a Parser::GFGParsing instance.
That, in turn, contains a Parser::GFGChart object; such a chart object can become very complex for long input and its manipulation requires a deep understanding of the Earley algorithm.

What we need to get a real parser is a component that translates the chart object into a parse tree.
This is a responsibility of the class Rley::Engine: it acts as a facade interface that simplifies the use of the lower-level Rley classes.
So we are ready to implement the parser.

At last, a parser

The TOMLParser class defines two methods only.

First, the constructor:

  def initialize
    # Create a Rley facade object
    @engine = Rley::Engine.new do |cfg|
      cfg.diagnose = true
    end

    # Step 1. Load Lox grammar
    @engine.use_grammar(TOMLGrammar)
  end

It merely initializes an Engine object, then it tells which grammar to use...

The second method, does the much demanded parsing:

  def parse(source)
    tokenizer = TOMLTokenizer.new(source)
    result = engine.parse(tokenizer.tokens)

    unless result.success?
      # Stop if the parse failed...
      line1 = "Parsing failed\n"
      line2 = "Reason: #{result.failure_reason.message}"
      raise SyntaxError, line1 + line2
    end

    engine.convert(result)
  end

Except for the error handling, the method bears some resemblance with the demo script of the recognizer.
But the real magic occurs with the call engine.convert(result): this is the step where the GFGParsing object is translated into a parse tree.

Running the parser

The file toml_demo.rb is little script that visualizes the generated parse tree:

require_relative 'toml_parser'

# Sample TOML document to parse
toml_doc = <<-TOML
# This is a TOML document

title = "TOML Example"
enabled = true
TOML

parser = TOMLParser.new
ptree = parser.parse(toml_doc)

# Let's create a parse tree visitor
visitor = parser.engine.ptree_visitor(ptree)

# Let's create a formatter that will render the parse tree with characters
renderer = Rley::Formatter::Asciitree.new($stdout)

# Subscribe the formatter to the visitor's event and launch the visit
renderer.render(visitor)

The last three expressions in the script are all that is needed to visualize the parse tree:

  1. One creates a parse tree visitor;
  2. A formatter object is told to emit its ASCII rendering output to stdout;
  3. The formatter generates the output as a side-effect of the visit of the parse tree by the visitor.

This time, the output should look as follows:

toml
+-- expr-list
    +-- expr-list
    |   +-- expr-list
    |   +-- expression
    |       +-- keyval
    |           +-- key
    |           |   +-- UNQUOTED-KEY: 'title'
    |           +-- EQUAL: '='
    |           +-- val
    |               +-- STRING: '"TOML Example"'
    +-- expression
        +-- keyval
            +-- key
            |   +-- UNQUOTED-KEY: 'enabled'
            +-- EQUAL: '='
            +-- val
                +-- BOOLEAN: 'true'

The parse trees that Rley generates by default (i.e. without customization) is called a Concrete Syntax Tree (CST).

About CSTs

A CST is a tree-like data structure. It is a representation of the input string structured along the grammar rules used by the parser during its syntactic analysis.

The key CST ingredients

The following class diagram displays the classes with their relationships that are used in implementing CSTs.

In plain English:

  • The ParseTree class acts as a facade object that merely points to the root node of the CST parse tree.
  • The TerminalNode class represents the leaf nodes of the tree. As its name implies, this kind of node is associated with terminal symbols
  • The NonTerminalNode class represents the parent nodes of the tree. This kind of node is associated with non-terminal symbols. The child nodes - subnodes - correspond to the right-hand symbols of the rule used to parse the parent node.
  • Both TerminalNode and NonTerminalNode are both subclasses of the ParseTreeNode class.
  • All the displayed classes have an accept method that is used to implement the Visitor design pattern.

How does Rley build CSTs ?

After a successful parse, Rley walks over the raw parse results embodied in a Parser::GFGParsing object.
The "walking" proceeds backwards from the success end state and examines the predecessors of that state, and so on...
While walking from one entry to another, a walk event is emitted and delivered to one (or more) subscribers (Publisher-Subscriber pattern).
One such subscriber - selected by default by Rley - is a ParseRep::CSTBuilder instance.
The mission of this object is to construct the Concrete Syntax Tree and the construction process is explained now.

Outlining the CST construction process

As mentioned in the previous paragraph, Rley examines the raw parse results backwards, that is, from the end state.
This rearward walking direction has two implications for the CSTBuilder object:

  1. The right-hand side of production rules are processed first, then the left-hand side;
  2. The members of a right-hand side are visited from right-to-left.

The CST construction algorithm can be outlined with the following pseudo-code:

  # Here starts the CST construction
  # Assumption: variable `top_rule` refers to the top-rule of the grammar
  subnodes := process_rhs_of(top_rule)
  root := NonTerminalNode.new(top_rule.lhs, subnodes)



  # Given a production rule, convert each member from right-hand side to a CST node...
  # ... and return an array of CST nodes
  function process_rhs_of(rule)
    result := []
    for_each member from rule.rhs do
      if member corresponds to a terminal symbol
        node := TerminalNode.new(member.symbol, member.token)
      else # ... parse entry corresponding to a non-terminal symbol
        subnodes := process_rhs_of(element.production) # Recursive call
        node := NonTerminalNode.new(element.production.lhs, subnodes)
      end
      result := prepend node to result # 'prepend' because we walk right-to-left
    end for

    return result
  end function

CST construction: an example

For the sake of conciseness, we'll trim down our TOML grammar to a tiny subset:

builder = Rley::grammar_builder do
  add_terminals('UNQUOTED-KEY', 'EQUAL', 'STRING')

  rule 'keyval' => 'key EQUAL val'
  rule 'key' => 'UNQUOTED-KEY'
  rule 'val' => 'STRING'
end
...

Let's assume that the parsed input string is:

  title = "TOML Example"

The top-level rule in use by the algorithm is:

  rule 'keyval' => 'key EQUAL val'

Inside the first call to process_rhs_of(), one iterates over the members of the rhs of the rule (that is, key EQUAL val).
The rightmost member corresponds to the non-terminal symbol val.
Looking at the pseudo-code, one sees that a call to process_rhs_of(rule val => 'STRING') is initiated.
Let's dive into that nested call:

  • An iteration loop over the right-hand side members begins. There is only one member: STRING, a terminal symbol.
  • Therefore the function process_rhs_of returns a single element array, a TerminalNode that refers to the STRING symbol and has the string value "TOML Example".

We're back inside the first call to process_rhs_of().\ The next step of the algorithm is to construct a NonTerminalNode for the val symbol with the STRING terminal as its sole sub-node. SO, the following CST-subtree input in the result array:

The iteration loop over the right-side members proceeds further: it reaches the symbol EQUAL, which is a terminal.
Hence a TerminalNode is created. The last (= leftmost) member is the key symbol, a non-terminal. Looking back to the tiny grammar, one sees that we have a similar situation than the processing of the val symbol.
In other words, after the loop, the result array contains the three following elements:

Finally, the last step: a NonTerminalNode is created for the keyval symbol and the elements of the result array are treated as its subnodes.
The resulting CST is depicted below:

Pros and cons of using CST

On the positive side:

  • Rley constructs automatically Concrete Syntax Trees
  • Limited number of classes: CSTs consist essentialy of TerminalNode and NonTerminalNode instances.
  • Out-of-the box parse tree visualization

On the negative side:

  • No customization of generated CSTs
  • CSTs follow very closely the grammar structure. This leads to deeply nested tree structure that is also brittle against grammar changes.

So, what are the CST good for? I use them in early phases of a parser implementation for testing purposes.

What's next, doctor?

This concludes the first iteration. We learned how create a grammar (of a small subset of TOML), a tokenizer (written by hand), then a parser.
In this installment, we learned about Concreted Syntax Trees. Rley is capable to generate them automatically from the parse results.
In the next iteration, we will expand our parser to cover more features of TOML language.