-
Notifications
You must be signed in to change notification settings - Fork 3
TUTO~4b: Ingredients for an AST
Abstraction is selective ignorance
(Andrew Koening, Barbara Moo: "Ruminations on C++";
Addison-Wesley, IBSN: 978-0-201-42339-1, page 9)
If one jumps to the iteration 3 folder, and tries from there the command-line:
ruby toml_demo.rb
The script emits a lengthy tree representation of a CST (aka "parse tree"). Here we limit to the 40 first lines:
toml
+-- expr-list
+-- expr-list
| +-- expr-list
| | +-- expr-list
| | | +-- expr-list
| | | | +-- expr-list
| | | | | +-- expr-list
| | | | | | +-- expr-list
| | | | | | | +-- expr-list
| | | | | | | | +-- expr-list
| | | | | | | | | +-- expr-list
| | | | | | | | | | +-- expr-list
| | | | | | | | | | | +-- expr-list
| | | | | | | | | | | | +-- expr-list
| | | | | | | | | | | | | +-- expr-list
| | | | | | | | | | | | | | +-- expr-list
| | | | | | | | | | | | | | +-- expression
| | | | | | | | | | | | | | +-- keyval
| | | | | | | | | | | | | | +-- key
| | | | | | | | | | | | | | | +-- simple-key
| | | | | | | | | | | | | | | +-- UNQUOTED-KEY: 'title'
| | | | | | | | | | | | | | +-- EQUAL: '='
| | | | | | | | | | | | | | +-- val
| | | | | | | | | | | | | | +-- STRING: '"TOML Example"'
| | | | | | | | | | | | | +-- expression
| | | | | | | | | | | | | +-- table
| | | | | | | | | | | | | +-- std-table
| | | | | | | | | | | | | +-- LBRACKET: '['
| | | | | | | | | | | | | +-- key
| | | | | | | | | | | | | | +-- simple-key
| | | | | | | | | | | | | | +-- UNQUOTED-KEY: 'owner'
| | | | | | | | | | | | | +-- RBRACKET: ']'
| | | | | | | | | | | | +-- expression
| | | | | | | | | | | | +-- keyval
| | | | | | | | | | | | +-- key
| | | | | | | | | | | | | +-- simple-key
| | | | | | | | | | | | | +-- UNQUOTED-KEY: 'name'
| | | | | | | | | | | | +-- EQUAL: '='
| | | | | | | | | | | | +-- val
| | | | | | | | | | | | +-- STRING: '"Tom Preston-Werner"'
It doesn't take long to grasp the following details:
- The root of the tree is the start symbol (here: toml)
- To every left-hand side of a rule, a parent node (i.e. a
Rley::PTree::NonTerminalNode
) is constructed - To each right-side member a child node is created
- A child node that corresponds to a non-terminal symbol gets its own children by applying the rule
- The tree is built from the application of the syntax rules required to match the input string to the start symbol.
- A child node that corresponds to a terminal symbol is a leaf node and is an
Rley::PTree::TerminalNode
instance.
The advantage, of CST (parse trees) is that they can automatically be constructed.
On the downside, these tend to be lengthy and deeply nested, making them unwieldy for subsequent processing.
Furthermore, being derived directly from the rules, they are brittle to grammar changes.
An abstract syntax tree, is a custom tree data structure aimed to represent the parse results in a convenient way for the semantic analysis phase.
Semantic analysis covers topics such as type checking, name resolution, and other prepatory steps before the code generation (or execution).
To simplify matters, an AST could be built from a parse tree (CST) with the following changes:
- Remove nodes corresponding to separator and delimiter tokens
- Replace node nesting due to recursive rules by arrays
- Remove unnecessary intermediate nodes
- Use custom tree nodes to represent key features of the language being parsed
Let's zoom-in a part of the previous parse tree:
+-- expression
+-- keyval
+-- key
| +-- simple-key
| +-- UNQUOTED-KEY: 'title'
+-- EQUAL: '='
+-- val
+-- STRING: '"TOML Example"'
The node corresponding to the equal sign (=) is useless: the equal sign was used to separate the key from its associated value in a key-value pair. But since both sides of the equal sign have their own node, there is no more need for a separator node.
The node labelled simple-key
can be dropped because it is redundant: an UNQUOTED-KEY terminal is one type of simple key.
Thus our previous partial parse tree can be simplified to something like:
+-- expression
+-- keyval
+-- key
| +-- UNQUOTED-KEY: 'title'
+-- val
+-- STRING: '"TOML Example"'
But that's not the end of the story yet.
The concept of key-value pair is central in TOML, so it makes sense to create a dedicated node class.
This node class will have two child nodes: one for the key part, a second for the value part.
A specific node class has also the advantage to have specialized methods that can help in later phase, something that a generic off-the-shelf Rley::PTree::NonTerminalNode
cannot do.
The next class diagram lists the types of AST nodes and the relationships between these classes.
The topmost class, the abstract class TOMLASTNode
is the generalization of every node in an AST tree. Its immediate descendent, the TOMLLiteralNode
and CompositeNode
classes are worth mentioning:
- A
TOMLLiteralNode
is used to implement a leaf node of an AST. Itsvalue
attribute is initialized with a TOML datatype value (e.g. a date-time offset) or a simple key (e.g. an unquoted key). - The
CompositeNode
class is a generalization of parent AST nodes. The class defines the attributesubnodes
which refers to the child nodes. For instance, for anTOMLArrayNode
its sub-nodes are the elements of the array.
All the concrete classes exhibits at least two methods: to_text
and accept
.
While the purpose of the first method, to_text
, is rather obvious: to emit a text representation of the node; the accept
methods requires a little explanation.
This accept
method is part of the Visitor design pattern implementation.
Many interpreter or compiler require an ability to walk over the nodes of an AST, there the Visitor pattern becomes handy.
All the accept methods follows the same structure, therefore we'll list only one implementation:
# Implementation of TOMLTableNode#accept method
def accept(visitor)
visitor.visit_table_node(self)
end
In fact, each accept
method boils down to a call to a visit_xxxx
method, where the xxxx suffix is related to the name of the class hosting the accept method. The purpose of these visit_xxxx
methods is to trigger the relevant behavior when visiting a 'xxxx' node. The visited node is passed as an argument to the visit_xxxx
method (that's the self
pseudo-veriable in the last code snippet).
But before visiting an AST, it might preferable to learn how Rley builds an AST.
That will be the subject of the next installment.