Skip to content

TUTO~4b: Ingredients for an AST

famished-tiger edited this page Mar 3, 2022 · 13 revisions

Abstraction is selective ignorance
(Andrew Koening, Barbara Moo: "Ruminations on C++";
Addison-Wesley, IBSN: 978-0-201-42339-1, page 9)

A Concrete Syntax Tree (CST) example

If one jumps to the iteration 3 folder, and tries from there the command-line:

  ruby toml_demo.rb

The script emits a lengthy tree representation of a CST (aka "parse tree"). Here we limit to the 40 first lines:

toml
+-- expr-list
    +-- expr-list
    |   +-- expr-list
    |   |   +-- expr-list
    |   |   |   +-- expr-list
    |   |   |   |   +-- expr-list
    |   |   |   |   |   +-- expr-list
    |   |   |   |   |   |   +-- expr-list
    |   |   |   |   |   |   |   +-- expr-list
    |   |   |   |   |   |   |   |   +-- expr-list
    |   |   |   |   |   |   |   |   |   +-- expr-list
    |   |   |   |   |   |   |   |   |   |   +-- expr-list
    |   |   |   |   |   |   |   |   |   |   |   +-- expr-list
    |   |   |   |   |   |   |   |   |   |   |   |   +-- expr-list
    |   |   |   |   |   |   |   |   |   |   |   |   |   +-- expr-list
    |   |   |   |   |   |   |   |   |   |   |   |   |   |   +-- expr-list
    |   |   |   |   |   |   |   |   |   |   |   |   |   |   +-- expression
    |   |   |   |   |   |   |   |   |   |   |   |   |   |       +-- keyval
    |   |   |   |   |   |   |   |   |   |   |   |   |   |           +-- key
    |   |   |   |   |   |   |   |   |   |   |   |   |   |           |   +-- simple-key
    |   |   |   |   |   |   |   |   |   |   |   |   |   |           |       +-- UNQUOTED-KEY: 'title'
    |   |   |   |   |   |   |   |   |   |   |   |   |   |           +-- EQUAL: '='
    |   |   |   |   |   |   |   |   |   |   |   |   |   |           +-- val
    |   |   |   |   |   |   |   |   |   |   |   |   |   |               +-- STRING: '"TOML Example"'
    |   |   |   |   |   |   |   |   |   |   |   |   |   +-- expression
    |   |   |   |   |   |   |   |   |   |   |   |   |       +-- table
    |   |   |   |   |   |   |   |   |   |   |   |   |           +-- std-table
    |   |   |   |   |   |   |   |   |   |   |   |   |               +-- LBRACKET: '['
    |   |   |   |   |   |   |   |   |   |   |   |   |               +-- key
    |   |   |   |   |   |   |   |   |   |   |   |   |               |   +-- simple-key
    |   |   |   |   |   |   |   |   |   |   |   |   |               |       +-- UNQUOTED-KEY: 'owner'
    |   |   |   |   |   |   |   |   |   |   |   |   |               +-- RBRACKET: ']'
    |   |   |   |   |   |   |   |   |   |   |   |   +-- expression
    |   |   |   |   |   |   |   |   |   |   |   |       +-- keyval
    |   |   |   |   |   |   |   |   |   |   |   |           +-- key
    |   |   |   |   |   |   |   |   |   |   |   |           |   +-- simple-key
    |   |   |   |   |   |   |   |   |   |   |   |           |       +-- UNQUOTED-KEY: 'name'
    |   |   |   |   |   |   |   |   |   |   |   |           +-- EQUAL: '='
    |   |   |   |   |   |   |   |   |   |   |   |           +-- val
    |   |   |   |   |   |   |   |   |   |   |   |               +-- STRING: '"Tom Preston-Werner"'

It doesn't take long to grasp the following details:

  1. The root of the tree is the start symbol (here: toml)
  2. To every left-hand side of a rule, a parent node (i.e. a Rley::PTree::NonTerminalNode) is constructed
  3. To each right-side member a child node is created
  4. A child node that corresponds to a non-terminal symbol gets its own children by applying the rule
  5. The tree is built from the application of the syntax rules required to match the input string to the start symbol.
  6. A child node that corresponds to a terminal symbol is a leaf node and is an Rley::PTree::TerminalNode instance.

The advantage, of CST (parse trees) is that they can automatically be constructed.
On the downside, these tend to be lengthy and deeply nested, making them unwieldy for subsequent processing.
Furthermore, being derived directly from the rules, they are brittle to grammar changes.

On AST

An abstract syntax tree, is a custom tree data structure aimed to represent the parse results in a convenient way for the semantic analysis phase.
Semantic analysis covers topics such as type checking, name resolution, and other prepatory steps before the code generation (or execution).

To simplify matters, an AST could be built from a parse tree (CST) with the following changes:

  1. Remove nodes corresponding to separator and delimiter tokens
  2. Replace node nesting due to recursive rules by arrays
  3. Remove unnecessary intermediate nodes
  4. Use custom tree nodes to represent key features of the language being parsed

A CST to AST example

Let's zoom-in a part of the previous parse tree:

+-- expression
    +-- keyval
        +-- key
        |   +-- simple-key
        |       +-- UNQUOTED-KEY: 'title'
        +-- EQUAL: '='
        +-- val
            +-- STRING: '"TOML Example"'

Step: Remove nodes corresponding to separator and delimiter tokens

The node corresponding to the equal sign (=) is useless: the equal sign was used to separate the key from its associated value in a key-value pair. But since both sides of the equal sign have their own node, there is no more need for a separator node.

Step: Remove unnecessary intermediate nodes

The node labelled simple-key can be dropped because it is redundant: an UNQUOTED-KEY terminal is one type of simple key.

Thus our previous partial parse tree can be simplified to something like:

+-- expression
    +-- keyval
        +-- key
        |   +-- UNQUOTED-KEY: 'title'
        +-- val
            +-- STRING: '"TOML Example"'

But that's not the end of the story yet.

Step: Use custom tree nodes to represent key features of the language being parsed

The concept of key-value pair is central in TOML, so it makes sense to create a dedicated node class.
This node class will have two child nodes: one for the key part, a second for the value part.
A specific node class has also the advantage to have specialized methods that can help in later phase, something that a generic off-the-shelf Rley::PTree::NonTerminalNode cannot do.

The AST building blocks

The next class diagram lists the types of AST nodes and the relationships between these classes. TOML AST node class hierarchy

The topmost class, the abstract class TOMLASTNode is the generalization of every node in an AST tree. Its immediate descendent, the TOMLLiteralNode and CompositeNode classes are worth mentioning:

  • A TOMLLiteralNode is used to implement a leaf node of an AST. Its value attribute is initialized with a TOML datatype value (e.g. a date-time offset) or a simple key (e.g. an unquoted key).
  • The CompositeNode class is a generalization of parent AST nodes. The class defines the attribute subnodes which refers to the child nodes. For instance, for an TOMLArrayNode its sub-nodes are the elements of the array.

All the concrete classes exhibits at least two methods: to_text and accept. While the purpose of the first method, to_text, is rather obvious: to emit a text representation of the node; the accept methods requires a little explanation.
This accept method is part of the Visitor design pattern implementation.
Many interpreter or compiler require an ability to walk over the nodes of an AST, there the Visitor pattern becomes handy.
All the accept methods follows the same structure, therefore we'll list only one implementation:

  # Implementation of TOMLTableNode#accept method
  def accept(visitor)
    visitor.visit_table_node(self)
  end

In fact, each accept method boils down to a call to a visit_xxxx method, where the xxxx suffix is related to the name of the class hosting the accept method. The purpose of these visit_xxxx methods is to trigger the relevant behavior when visiting a 'xxxx' node. The visited node is passed as an argument to the visit_xxxx method (that's the self pseudo-veriable in the last code snippet).

But before visiting an AST, it might preferable to learn how Rley builds an AST.
That will be the subject of the next installment.