Skip to content

Latest commit

 

History

History
144 lines (94 loc) · 9.8 KB

README.md

File metadata and controls

144 lines (94 loc) · 9.8 KB

IDNI parser library

This C++ parsing library was created because other available parsers were missing features required for development of our non-turing but still very expressive languages:

  • TML (Tau Meta Language), a logical programming language
  • Tau, a software specification language...

This library additionally provides a parser for TGF (Tau Grammar form), an EBNF-like language for describing boolean grammars.

It also provides a tgf tool which takes a grammar described in a TGF file and it can help debug the grammar or generate C++ code of the parser.

Features

  • Boolean (including conjunctive) grammars
  • Earley based parser with garbage collection producing parse forest containing all the parsed trees
  • scannerless (lexerless) parsing
  • forest traversal with callbacks and extraction of trees and graphs to deal with transformations and with ambiguity
  • tgf tool to test grammars in TGF file or to generate a C++ code representing a parser for a given TGF file.
  • predefined or custom classes of characters defined by a function (<cctype>-like functions)
  • optional recoding of an input into another type which can can be used for lexing, reencoding or accessing binary data
  • own UTF-8 support
  • no dependencies on other libraries

Basic terminology

Literal refers to a specific value or sequence of characters that appears exactly as it is in the input or in the grammar rules. It represents a fixed, concrete element that must be matched in order for a particular rule or pattern to be satisfied. Literal can be terminal or non-terminal.

A terminal literal represents a specific token or value in the input language. It corresponds to the actual elements that appear in the input stream during parsing. Terminal literals are also referred to as terminals or terminal symbols.

A non-terminal literal represents a syntactic category or a placeholder that can be replaced by a sequence of terminals and/or non-terminals. Non-terminal literals are used to define the structure of the language and serve as intermediate building blocks for generating valid sentences or expressions.

A production rule consists of two parts: a left-hand side (LHS) and a right-hand side (RHS). The left-hand side contains a single non-terminal symbol, and the right-hand side contains a sequence of terminals, non-terminals, or both. The production rule indicates that the non-terminal on the left-hand side can be rewritten or expanded into the sequence of symbols on the right-hand side.

Grammar is a formal system that defines the syntax and structure of a language. It consists of terminals, non-terminals, production rules and a starting symbol (which is usually a non-terminal).

Conjunctive Grammar is a grammar which allows usage of conjunctions of literal sequences in production rules. The conjunction acts as a logical "AND" operation, indicating that all the conjuncted literal sequences must be matched for the whole rule to be satisfied.

Boolean Grammar is a conjunctive grammar which allows also usage of negation of literal sequences. Negation acts as a logical "NOT" operation, indicating that if the negated literal sequence is matched the whole rule fails to be satisfied.

DNF or Disjunctive Normal form is a specific form of representing logical formulas, Boolean expressions or production rules in our case. In DNF, a logical formula is represented as a disjunction (OR) of one or more conjunctions (AND) of literals.

EBNF or Extended Backus–Naur form is a family of metasyntax notations, any of which can be used to express a context-free grammar.

TGF or Tau Grammar Form is an EBNF based form to describe grammars which are usable by this library. TGF understands EBNF syntax (+, *, [] and {}) so it is easier to define repetition or optionality.

Namespace

Library uses an idni namespace for its declarations.

For simplicity, all examples in library documentation expect declaration using namespace idni;.

Templating

Most of the structures provided by this library are templates to enable generic parsing of any input data while parsing any type of terminals. Usually they are declared like this:

template <typename C = char, typename T = C> ... where

  • C is an input type of input data (usually characters).
  • T is a type of a terminal.

Fully supported and tested types are char and char32_t (Unicode support). Though it is possible to supply encoder and decoder functions to your parser and these can convert input elements of any type into terminals of any type.

If T differs from C it is required to provide decoder and encoder functions in parser::options. parser use these recoders to convert input data into terminals and back. See recoders for more information.

Classes and structs

Click on the type name in the following list to see more detailed documentation of its API:

  • nonterminals - id map of non-terminals to their names
  • lit, lits, conjs, disjs, prod and prods - basic building blocks for creating a grammar programatically. prods represents a grammar containing production rules as disjunctions of conjunctions of sequences of literals.
  • char_class_fn, char_class_fns - Character class function is an alternative way to create a production rule. It is a function returning true if a terminal character is a member of a character class. It can be custom or there are predefined cctype-like functions (isalnum, isdigit...) provided. char_class_fns is a container for such functions.
  • shaping_options - options for shaping a parsed tree
  • grammar::options - options for grammar
  • grammar - represents a grammar created from production rules prods and character class functions char_class_fns.
  • parser::options - plain struct of options changing behavior of the parser
  • parser - uses a grammar for parsing an input and produces a result
  • parser::result - result of parsing containing the parsed forest
  • parser::error - detailed information about a parse error
  • parser::input - wraper for an input data pointer, an input stream or a file. It provides an API to access it from a parser or a decoder
  • parser::pnode - parser's forest node
  • forest - a result of parsing containing all the parsed trees. Provides means of traversal or extraction of trees or graphs.
  • forest::graph - a graph extracted from a forest
  • forest::tree - a tree extracted from a graph
  • tgf - Tau Grammar Form parser - reads a grammar in TGF format from a string or a file. An alternative way to describe a grammar instead of creating it programatically.
  • traverser - struct for traversing and accessing rewriter trees
  • rewriting - API for rewriting a resulting parse tree
  • measure - simple struct for measuring time

Classes and structs usable for building command line interfaces

  • cli - provides CLI arguments management and processing
  • repl - provides terminal (currently only Linux is supported) REPL UI
  • term_colors - provides terminal color escape codes

Functions

There are several areas covered by functions provided by this library

  • predefined character class functions - cctype-like predefined character class functions
  • recoders - decoder and encoder for UTF-8 in char <-> Unicode in char32_t
  • UTF-8 - UTF-8 and Unicode support
  • devhelpers - helper forest transformations to various formats (TML facts, TML rules, DOT) useful when developing a parser

TGF - Tau Grammar Form

TGF is an EBNF-based form to describe grammars. Specification of the form can be found on page Tau Grammar Form

TGF tool

This library comes with a CLI executable tgf which features viewing, testing and debugging grammars written in TGF and it can also generate a C++ code from TGF grammar which is then usable in a C++ project.

More detailed information about this CLI can be found on page TGF tool

Tutorials

CSV parser

Beginner tutorial to quick start using this library.

You can learn how to use this library in a CSV parser tutorial.

Part 1 shows a very simple parser with all it's required. Each following part expands the previous one. Read comments since they are always related to a new or a changed code.

  • part 1 - minimal parser of positive integers
  • part 2 - using predefined character classes
  • part 3 - refactor parser into its struct
  • part 4 - parse also negative integers
  • part 5 - parse strings and nulls, traverse
  • part 6 - parse comma separated values
  • part 7 - parse new line separated rows
  • part 8 - replace programmatically created grammar with a grammar in TGF
  • part 9 - using EBNF syntax in TGF
  • part 10, csv.tgf - using tgf tool to generate a parser from a TGF file