Skip to content

arithy/packcc

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PackCC

Overview

PackCC is a parser generator for C. Its main features are as follows:

  • Generates your parser in C from a grammar described in a PEG,
  • Gives your parser great efficiency by packrat parsing,
  • Supports direct and indirect left-recursive grammar rules.

The grammar of your parser can be described in a PEG (Parsing Expression Grammar). The PEG is a top-down parsing language, and is similar to the regular-expression grammar. Compared with a bottom-up parsing language, like Yacc's one, the PEG is much more intuitive and cannot be ambiguous. The PEG does not require tokenization to be a separate step, and tokenization rules can be written in the same way as any other grammar rules.

Your generated parser can parse inputs very efficiently by packrat parsing. The packrat parsing is the recursive descent parsing algorithm that is accelerated using memoization. By using packrat parsing, any input can be parsed in linear time. Without it, however, the resulting parser could exhibit exponential time performance in the worst case due to the unlimited look-ahead capability.

Unlike common packrat parsers, PackCC can support direct and indirect left-recursive grammar rules. This powerful feature enables you to describe your language grammar in a much simpler way. (The algorithm is based on the paper "Packrat Parsers Can Support Left Recursion" authored by A. Warth, J. R. Douglass, and T. Millstein.)

Some additional features are as follows:

  • Thread-safe and reentrant,
  • Supports UTF-8 multibyte characters (version 1.4.0 or later),
  • Generates more ease-of-understanding parser source codes,
  • Consists of just a single compact source file,
  • Under MIT license. (not under a certain contagious license!)

The generated code is beautified and as ease-of-understanding as possible. Actually, it uses lots of goto statements, but the control flows are much more traceable than goto spaghetti storms generated by Yacc or other parser generators. This feature is irrelevant to common users, but helpful for PackCC developers to debug it.

PackCC itself is under MIT license, but you can distribute your generated code under any license you like.

Installation

Building Directly

You can obtain the executable packcc by compiling src/packcc.c using your favorite C compiler. An example of the compile command is shown below.

cc -o packcc src/packcc.c

Using CMake

For your convenience, a build environment using CMake is prepared. You can build and test packcc by the following commands.

Setup: First, you have to execute the commands shown below.

mkdir build
cd build
cmake ..

Build: You can build the executable packcc by the command shown below.

# Execute this in the directory `build`.
cmake --build . --config Release

If using a build system for a Unix-like OS, MinGW, or Cygwin, which is typically make, the executable packcc will be created in the current directory build. If using Visual Studio on Windows, it will be created in the directory build\Release.

Test: Optionally, you can test packcc by the command shown below (Visual Studio is not supported). Bats-core and Uncrustify are required (see tests/README.md for details).

# Execute this in the directory `build`.
cmake --build . --config Release --target check

Install: If you want to install packcc with the import files in your system, you can use the command shown below.

# Execute this in the directory `build`.
cmake --install . --config Release

For a Unix-like OS, MinGW, or Cygwin, they will be installed in the directory /usr/local. If using Visual Studio, they will be installed in the directory C:\Program Files (x86)\packcc by default. You can change the installation directory by using the option --prefix.

# Execute this in the directory `build`.
cmake --install . --config Release --prefix /path/to/install/directory

Here, /path/to/install/directory denotes the installation directory.

Note that you have to execute the installation command with the administrative permission, for example, by becoming 'root' or by using sudo command, unless your user account is the owner of the installation directory.

Usage

Command

You must prepare a PEG source file in advance. For details of the PEG source syntax, see the section "Syntax". Here, let the file name example.peg for example.

packcc example.peg

By running this, the parser source example.h and example.c are generated.

If no PEG file name is specified, the PEG source is read from the standard input, and -.h and -.c will be generated.

The base name of the parser source files can be changed by -o option.

packcc -o parser example.peg

By running this, the parser source parser.h and parser.c are generated. This option can be specified only once.

A directory to search for import files can be added by -I option (version 2.0.0 or later). This option can be specified as many times as needed. The firstly specified directory will be searched first, the secondly specified directory will be searched next, and so on.

packcc -I foo -I bar/baz example.peg

By running this, the directory foo is searched first, and the directory bar/baz is searched next. The directories specified by this option have higher priority than those specified in the environment variable PCC_IMPORT_PATH and the default directories. For more details of import, see the explanation of %import written in the section "Syntax".

If you want to disable UTF-8 support, specify the command line option -a or --ascii (version 1.4.0 or later).

If you want to insert #line directives in the generated source and header files, specify the command line option -l or --lines (version 1.7.0 or later). It is helpful to trace compilation errors of the generated source and header files back to the codes written in the PEG source file.

If you want to confirm the version of the packcc command, execute the below.

packcc -v

Syntax

A grammar consists of a set of named rules. A rule definition can be split into multiple lines.

rulename <- pattern

The rulename is the name of the rule to define. The pattern is a text pattern that contains one or more of the following elements.

rulename

The element stands for the entire pattern in the rule with the name given by rulename.

variable:rulename

The element stands for the entire pattern in the rule with the name given by rulename. The variable is an identifier of a rule variable associated with the semantic value returned from the rule by assigning to $$ in its action. The identifier can be referred to in subsequent actions as a variable. The example is shown below.

term <- l:term _ '+' _ r:factor { $$ = l + r; }

A rule variable identifier must consist of alphabets (both uppercase and lowercase letters), digits, and underscores. The first letter must be an alphabet. The reserved keywords in C cannot be used.

sequence1 / sequence2 / ... / sequenceN

Each sequence is tried in turn until one of them matches, at which time matching for the overall pattern succeeds. If no sequence matches then the matching for the overall pattern fails. The operator slash (/) has the least priority. The example is shown below.

'foo' rule1 / 'bar'+ [0-9]? / rule2

This pattern tries matching of the first sequence ('foo' rule1). If it succeeds, then the overall pattern matching succeeds and ends without evaluating the subsequent sequences. Otherwise, it tries matching of the next sequence ('bar'+ [0-9]?). If it succeeds, then the overall pattern matching succeeds and ends without evaluating the subsequent sequence. Finally, it tries matching of the last sequence (rule2). If it succeeds, then the overall pattern matching succeeds. Otherwise, the overall pattern matching fails.

'string'

A character or string enclosed in single quotes is matched literally. The ANSI C escape sequences are recognized within the characters. The UNICODE escape sequences (ex. \u20AC) are also recognized including surrogate pairs, if the command line option -a is not specified (version 1.4.0 or later). The example is shown below.

'foo bar'

"string"

A character or string enclosed in double quotes is matched literally. The ANSI C escape sequences are recognized within the characters. The UNICODE escape sequences (ex. \u20AC) are also recognized including surrogate pairs, if the command line option -a is not specified (version 1.4.0 or later). The example is shown below.

"foo bar"

[character class]

A set of characters enclosed in square brackets matches any single character from the set. The ANSI C escape sequences are recognized within the characters. The UNICODE escape sequences (ex. \u20AC) are also recognized including surrogate pairs, if the command line option -a is not specified (version 1.4.0 or later). If the set begins with a caret (^), the set is negated (the element matches any character not in the set). Any pair of characters separated with a dash (-) represents the range of characters from the first to the second, inclusive. The examples are shown below.

[abc]
[^abc]
[a-zA-Z0-9_]

.

A dot (.) matches any single character. Note that the only time this fails is at the end of the input, where there is no character to match.

^

A caret (^) matches the beginning of the input (version 2.1.0 or later).

element ?

It specifies that the element is optional. If present on the input, it is consumed and the match succeeds. If not present on the input, no text is consumed and the match succeeds anyway.

element *

It specifies that the element is optional and repeatable. If present on the input, one or more occurrences of the element are consumed and the match succeeds. If no occurrence of the element is present on the input, the match succeeds anyway.

element +

It specifies that the element is repeatable. If present on the input, one or more occurrences of the element are consumed and the match succeeds. If no occurrence of the element is present on the input, the match fails.

& element

The predicate succeeds only if the element can be matched. The input text scanned while matching element is not consumed from the input and remains available for subsequent matching.

! element

The predicate succeeds only if the element cannot be matched. The input text scanned while matching element is not consumed from the input and remains available for subsequent matching.

A popular idiom is the following, which matches the end of the input, after the last character of the input has already been consumed.

!.

( pattern )

Parentheses are used for grouping (modifying the precedence of the pattern).

< pattern >

Angle brackets are used for grouping (modifying the precedence of the pattern) and text capturing. The captured text is numbered in evaluation order, and can be referred to later using $1, $2, etc.

$n

A dollar ($) followed by a positive integer represents a text previously captured. The positive integer corresponds to the order of capturing. A $1 represents the first captured text. The examples are shown below.

< [0-9]+ > 'foo' $1

This matches 0foo0, 123foo123, etc.

'[' < '='* > '[' ( !( ']' $1 ']' ) . )* ( ']' $1 ']' )

This matches [[...]], [=[...]=], [==[...]==], etc.

{ c source code }

Curly braces surround an action. The action is arbitrary C source code to be executed at the end of matching.

The intrinsic macros ${prefix} and ${PREFIX} anywhere in C source code are replaced with the prefix and the uppercased one respectively (version 2.1.0 or later). For the details, see the explanation of %prefix.

Any braces within the action must be properly nested. Note that braces in directive lines and in comments (/*...*/ and //...) are appropriately ignored.

One or more actions can be inserted in any places between elements, at the start, and at the end in the pattern. Actions are not executed where matching fails.

[0-9]+ 'foo' { puts("OK"); } 'bar' / [0-9]+ 'foo' 'baz'

In this example, if the input is 012foobar, the action { puts("OK"); } is to be executed, but if the input is 012foobaz, the action is not to be executed. All matched actions are guaranteed to be executed only once.

In the action, the C source code can use the predefined variables below.

  • $$ : The output variable, to which the result of the rule is to be stored. The data type is the one specified by %value. The default data type is int.
  • auxil : The user-defined data that has been given via the API function pcc_create(). The data type is the one specified by %auxil. The default data type is void *.
  • variable : The rule variable to store the result of another rule that has already been evaluated. If the rule has not been evaluated, it is ensured that the value is zero-cleared (version 1.7.1 or later). The data type is the one specified by %value. The default data type is int.
  • $n : The string of the captured text. The n is the positive integer that corresponds to the order of capturing. The variable $1 holds the string of the first captured text. It is read-only.
  • $ns : The start position in the input of the captured text, inclusive. The n is the positive integer that corresponds to the order of capturing. The variable $1s holds the start position of the first captured text. It is read-only.
  • $ne : The end position in the input of the captured text, exclusive. The n is the positive integer that corresponds to the order of capturing. The variable $1e holds the end position of the first captured text. It is read-only.
  • $0 : The string of the text between the start position in the input at which the rule pattern begins to match and the current position in the input at which the element immediately before the action ends to match. It is read-only.
  • $0s : The start position in the input at which the rule pattern begins to match. It is read-only.
  • $0e : The current position in the input at which the element immediately before the action ends to match. It is read-only.
  • @variable : The marker variable (version 3.0.0 or later). For the details, see the explanation of %marker. It is read-only.

An example is shown below.

term <- l:term _ '+' _ r:factor { $$ = l + r; }
factor <- < [0-9]+ >            { $$ = atoi($1); }
_ <- [ \t]*

Note that the string data held by $n and $0 are discarded immediately after evaluation of the action. If the string data are needed after the action, they must be copied in $$ or auxil. If they are required to be copied in $$, it is recommended to define a structure as the type of output data using %value, and to copy the necessary string data in its member variable. Similarly, if they are required to be copied in auxil, it is recommended to define a structure as the type of user-defined data using %auxil, and to copy the necessary string data in its member variable.

The variable @@ to store a result of the programmable predicate cannot be used in actions (the programmable predicate is available since version 3.0.0).

The position values are 0-based; that is, the first position is 0. The data type is size_t (before version 1.4.0, it was int).

element ~ { c source code }

Curly braces following tilde (~) surround an error action. The error action is arbitrary C source code to be executed at the end of matching only if the preceding element matching fails.

The intrinsic macros ${prefix} and ${PREFIX} anywhere in C source code are replaced with the prefix and the uppercased one respectively (version 2.1.0 or later). For the details, see the explanation of %prefix.

Any braces within the error action must be properly nested. Note that braces in directive lines and in comments (/*...*/ and //...) are appropriately ignored.

One or more error actions can be inserted in any places after elements in the pattern. The operator tilde (~) binds less tightly than any other operator except alternation (/) and sequencing.

The error action is intended to make error handling and recovery code easier to write. In the error action, all predefined variables described above are available as well.

The examples are shown below.

rule1 <- e1 e2 e3 ~{ error("e[12] ok; e3 has failed"); }
rule2 <- (e1 e2 e3) ~{ error("one of e[123] has failed"); }

& { c source code }

Curly braces following ampersand (&) surround a programmable predicate (version 3.0.0 or later). The programmable predicate is arbitrary C source code to be executed during matching. The matching continues only if a nonzero value is assigned to the output variable @@. The initial value of @@ is 1. No input text is not consumed from the input and remains available for subsequent matching.

The intrinsic macros ${prefix} and ${PREFIX} anywhere in C source code are replaced with the prefix and the uppercased one respectively. For the details, see the explanation of %prefix.

Any braces within the programmable predicate must be properly nested. Note that braces in directive lines and in comments (/*...*/ and //...) are appropriately ignored.

One or more programmable predicates can be inserted in any places between elements, at the start, and at the end in the pattern. All values of marker variables are rolled back when subsequent matching fails.

In the programmable predicate, the C source code can use the predefined variables below.

  • @@ : The output variable, to which the result of the predicate is to be stored. The matching continues only if a nonzero value is assigned.
  • auxil : The user-defined data that has been given via the API function pcc_create(). The data type is the one specified by %auxil. The default data type is void *.
  • $n : The string of the captured text. The n is the positive integer that corresponds to the order of capturing. The variable $1 holds the string of the first captured text. It is read-only.
  • $ns : The start position in the input of the captured text, inclusive. The n is the positive integer that corresponds to the order of capturing. The variable $1s holds the start position of the first captured text. It is read-only.
  • $ne : The end position in the input of the captured text, exclusive. The n is the positive integer that corresponds to the order of capturing. The variable $1e holds the end position of the first captured text. It is read-only.
  • $0 : The string of the text between the start position in the input at which the rule pattern begins to match and the current position in the input at which the element immediately before the programmable predicate. It is read-only.
  • $0s : The start position in the input at which the rule pattern begins to match. It is read-only.
  • $0e : The current position in the input at which the element immediately before the programmable predicate. It is read-only.
  • @variable : The marker variable. For the details, see the explanation of %marker. It is writable.

Rule variables and $$ are not accessible in programmable predicates.

Caution: The result of the programmable predicate MUST always be the same in the situation where the combination of the values of the following variables is the same: all of $n, $ns, and $ne (n is a non-negative integer), and all marker variables @variable. Otherwise, the generated parser may work incorrectly.

An example is shown below.

%marker @count
rule <- &{ @count = 0 } ( '#' &{ @count++ } )* &{ @@ = (@count >= 5 && @count <= 10); }

This matches the string that is a sequence of '#' and its length is either of 5 to 10.

Note that the string data held by $n and $0 can be discarded immediately after evaluation of the programmable predicate. If the string data are needed after the programmable predicate execution, they must be copied in any of marker variables or auxil. If they are required to be copied in auxil, it is recommended to define a structure as the type of user-defined data using %auxil, and to copy the necessary string data in its member variable.

The position values are 0-based; that is, the first position is 0. The data type is size_t.

! { c source code }

Curly braces following exclamation (!) surround a negative programmable predicate (version 3.0.0 or later). The negative programmable predicate is arbitrary C source code to be executed during matching. The matching continues only if 0 is assigned to the output variable @@. The initial value of @@ is 0. No input text is not consumed from the input and remains available for subsequent matching.

The intrinsic macros ${prefix} and ${PREFIX} anywhere in C source code are replaced with the prefix and the uppercased one respectively. For the details, see the explanation of %prefix.

Any braces within the negative programmable predicate must be properly nested. Note that braces in directive lines and in comments (/*...*/ and //...) are appropriately ignored.

One or more negative programmable predicates can be inserted in any places between elements, at the start, and at the end in the pattern. All values of marker variables are rolled back when subsequent matching fails.

In the negative programmable predicate, the C source code can use the predefined variables shown in the explanation of the programmable predicate (& { c source code }).

Caution: The result of the programmable predicate MUST always be the same in the situation where the combination of the values of the following variables is the same: all of $n, $ns, and $ne (n is a non-negative integer), and all marker variables @variable. Otherwise, the generated parser may work incorrectly.

%header { c source code }

The specified C source code is copied basically verbatim to the C header file before the generated parser API function declarations. There are the following exceptions in verbatim copying.

  • The intrinsic macros ${prefix} and ${PREFIX} anywhere in C source codes are replaced with the prefix and the uppercased one respectively (version 2.1.0 or later). For the details, see the explanation of %prefix.
  • The escape sequences \$ and \@ anywhere in C source codes are replaced with $ and @ respectively (version 3.0.0 or later).

Any braces in the C source code must be properly nested. Note that braces in directive lines, comments (/*...*/ and //...), character literals, and string literals are appropriately ignored.

When %header is used multiple times, the respective C source codes are copied in order of their appearance.

%source { c source code }

The specified C source code is copied basically verbatim to the C source file before the generated parser implementation code. There are the following exceptions in verbatim copying.

  • The intrinsic macros ${prefix} and ${PREFIX} anywhere in C source codes are replaced with the prefix and the uppercased one respectively (version 2.1.0 or later). For the details, see the explanation of %prefix.
  • The escape sequences \$ and \@ anywhere in C source codes are replaced with $ and @ respectively (version 3.0.0 or later).

Any braces in the C source code must be properly nested. Note that braces in directive lines, comments (/*...*/ and //...), character literals, and string literals are appropriately ignored.

When %source is used multiple times, the respective C source codes are copied in order of their appearance.

%common { c source code }

This has the same effect as %header { c source code } %source { c source code }.

The specified C source code is copied basically verbatim to both of the C header file and the C source file before the generated parser API function declarations and the implementation code respectively. There are the following exceptions in verbatim copying.

  • The intrinsic macros ${prefix} and ${PREFIX} anywhere in C source codes are replaced with the prefix and the uppercased one respectively (version 2.1.0 or later). For the details, see the explanation of %prefix.
  • The escape sequences \$ and \@ anywhere in C source codes are replaced with $ and @ respectively (version 3.0.0 or later).

Any braces in the C source code must be properly nested. Note that braces in directive lines, comments (/*...*/ and //...), character literals, and string literals are appropriately ignored.

%earlyheader { c source code }

%earlysource { c source code }

%earlycommon { c source code }

Same as %header, %source and %common, respectively. The only difference is that these directives place the code at the very beginning of the generated file, before any code or includes generated by PackCC. This can be useful for example when it is necessary to modify behavior of standard libraries via a macro definition.

%value "output data type"

The type of output data, which is output as $$ in each action and can be retrieved from the parser API function pcc_parse(), is changed to the specified one from the default int. This can be used only once and cannot be used in imported files.

%auxil "user-defined data type"

The type of user-defined data, which is passed to the parser API function pcc_create(), is changed to the specified one from the default void *. This can be used only once and cannot be used in imported files.

%prefix "prefix"

The prefix of the parser API functions is changed to the specified one from the default pcc. This can be used only once and cannot be used in imported files.

To make an identifier unique to the generated parser, the intrinsic macros ${prefix} and ${PREFIX} are useful (version 2.1.0 or later). Anywhere in C source codes in any parts, the text ${prefix} is replaced with the prefix itself, and the text ${PREFIX} is replaced with the uppercased prefix. An example is shown below.

%prefix "my"

%source {
#define ${PREFIX}_CONSTANT 1
void ${prefix}_func(void);
}

The C source code above will become as follows.

#define MY_CONSTANT 1
void my_func(void);

Note that the intrinsic macro replacement is in effect even in C preprocessor macros, comments, and string literals.

%marker @identifier1 [ @identifier2 ... ]

The marker variables with the respective specified identifiers are declared (version 3.0.0 or later). This can be used multiple times and can be used also in imported files. A marker variable identifier must consist of alphabets (both uppercase and lowercase letters), digits, and underscores. The first letter must be an alphabet. All identifiers of the declared marker variables must be different from each other.

Marker variables are intended to be used as markers of the input text to be parsed, so that the parser can confine the text range where a specific pattern matches. The values of marker variables are associated with each text position.

A marker variable can be dealt as if it were an integer variable. The data type is ptrdiff_t, which is a signed integer type with the same bit length as pointer types. Usually, the bit length is 32 bits in 32-bit programs, and 64 bits in 64-bit programs. The initial integer value is 0. An example is shown below.

%marker @hashes_l @hashes_r

title <- &{ @hashes_l = 0; } ( '#' &{ @hashes_l++; } )+ [^#]+ &{ @hashes_r = 0; } ( '#' &{ @hashes_r++; } )* &{ @@ = (@hashes_l >= @hashes_r); }

Here, @hashes_l and @hashes_r are marker variables. This rule matches the following text for instance.

# xxx
# xxx #
### xxx
### xxx #
### xxx ##
### xxx ###

However, it fails to match following text for instance.

# xxx ###
### xxx ####

A marker variable can also have a string value, without affecting the existing integer value. The initial string value is NULL.

%marker @var_0 @var_1

rule_0 <- 'abc'* &{ @var_0.set_string("foo"); }
rule_1 <- < '012'+ > &{ @var_1.set_string($1); @var_1.append_string(@var_0.get_string()); }

The methods of marker variables are listed below.

  • @variable.get_string() : returns the string (const char *) stored in the marker variable; can be NULL; read-only.
  • @variable.set_string(str) : stores a string (const char *) in the marker variable; the argument str can be NULL.
  • @variable.append_string(str) : appends a string (const char *) to the string stored in the marker variable; the argument str can be NULL, which has no effect.
  • @variable.save() : saves the current integer and string values by pushing them into the variable-specific stack.
  • @variable.restore() : restores the previous integer and string values by popping them from the variable-specific stack.

Note that the integer and string values are 0 and NULL respectively if restore() is done more times than save() was done.

Marker variables are read-only in actions while they are writable in programmable predicates. In the C source codes at the other parts, all marker variables are inaccessible.

%import "import file name"

The content of the specified import file is expanded at the text location of %import (version 2.0.0 or later). This can be used multiple times anywhere and can be used also in imported files. The import file name can be a relative path to the current directory or an absolute path. If it is a relative path, the directories listed below are searched for the import file in the listed order.

  1. the directory where the file that imports the import file is located
  2. the directories specified with -I options
    • They are prioritized in order of their appearance in the command line.
  3. the directories specified by the environment variable PCC_IMPORT_PATH
    • They are prioritized in order of their appearance in the value of this variable.
    • The character used as a delimiter between directory names is the colon ':' if PackCC is built for a Unix-like platform such as Linux, macOS, and MinGW. The character is the semicolon ';' if PackCC is built as a native Windows executable. (This is exactly the same manner as the environment variable PATH.)
  4. the per-user default directory
    • This is the subdirectory .packcc/import in the home directory if PackCC is built for a Unix-like platform, and in the user profile directory, "C:\Users\username" for example, if PackCC is built as a native Windows executable.
  5. the system-wide default directory
    • This is the directory /usr/share/packcc/import if PackCC is built for a Unix-like platform, and is the subdirectory packcc/import in the common application data directory, "C:\ProgramData" for example.

Note that the file imported once is silently ignored when it is attempted to be imported again.

#comment

A comment can be inserted between # and the end of the line.

%%

A double percent %% terminates the section for rule definitions of the grammar. All text following %% is copied basically verbatim to the C source file after the generated parser implementation code. There are the following exceptions in verbatim copying.

  • The intrinsic macros ${prefix} and ${PREFIX} anywhere in C source codes are replaced with the prefix and the uppercased one respectively (version 2.1.0 or later). For the details, see the explanation of %prefix.
  • The escape sequences \$ and \@ anywhere in C source codes are replaced with $ and @ respectively (version 3.0.0 or later).

(The specification is determined by referring to peg/leg developed by Ian Piumarta.)

Import Files

The following import files are currently bundled.

For details, see here.

Macros

Some macros are prepared to customize the parser. The macro definition should be in %source section in the PEG source.

%source {
#define PCC_GETCHAR(auxil) get_character((auxil)->input)
#define PCC_BUFFERSIZE 1024
}

The following macros are available.

PCC_GETCHAR(auxil)

The function macro to get a character from the input. The user-defined data passed to the API function pcc_create() can be retrieved from the argument auxil. It can be ignored if no user-defined data. This macro must return a character code as an int type, or -1 if the input ends.

The default is defined as below.

#define PCC_GETCHAR(auxil) getchar()

PCC_ERROR(auxil)

The function macro to handle a syntax error. The user-defined data passed to the API function pcc_create() can be retrieved from the argument auxil. It can be ignored if no user-defined data. This macro need not return a value. It may abort the process (by using exit() for example) when a fatal error occurs, and can also return normally to deal with warnings.

The default is defined as below.

#define PCC_ERROR(auxil) pcc_error()
static void pcc_error(void) {
    fprintf(stderr, "Syntax error\n");
    exit(1);
}

PCC_MALLOC(auxil,size)

The function macro to allocate a memory block. The user-defined data passed to the API function pcc_create() can be retrieved from the argument auxil. It can be ignored if no user-defined data. The argument size is the number of bytes to allocate. This macro must return a pointer to the allocated memory block, or NULL if no sufficient memory is available.

The default is defined as below.

#define PCC_MALLOC(auxil, size) pcc_malloc_e(size)
static void *pcc_malloc_e(size_t size) {
    void *p = malloc(size);
    if (p == NULL) {
        fprintf(stderr, "Out of memory\n");
        exit(1);
    }
    return p;
}

PCC_REALLOC(auxil,ptr,size)

The function macro to reallocate the existing memory block. The user-defined data passed to the API function pcc_create() can be retrieved from the argument auxil. It can be ignored if no user-defined data. The argument ptr is the pointer to the previously allocated memory block. The argument size is the new number of bytes to reallocate. This macro must return a pointer to the reallocated memory block, or NULL if no sufficient memory is available. The contents of the memory block should be left unchanged in any case even if the reallocation fails.

The default is defined as below.

#define PCC_REALLOC(auxil, ptr, size) pcc_realloc_e(ptr, size)
static void *pcc_realloc_e(void *ptr, size_t size) {
    void *p = realloc(ptr, size);
    if (p == NULL) {
        fprintf(stderr, "Out of memory\n");
        exit(1);
    }
    return p;
}

PCC_FREE(auxil,ptr)

The function macro to free the existing memory block. The user-defined data passed to the API function pcc_create() can be retrieved from the argument auxil. It can be ignored if no user-defined data. The argument ptr is the pointer to the previously allocated memory block. This macro need not return a value.

The default is defined as below.

#define PCC_FREE(auxil, ptr) free(ptr)

PCC_DEBUG(auxil,event,rule,level,pos,buffer,length)

The function macro for debugging (version 1.5.0 or later). Sometimes, especially for complex parsers, it is useful to see how exactly the parser processes the input. This macro is called on important events and allows to log or display the current state of the parser. The argument rule is a string that contains the name of the currently evaluated rule. The non-negative integer level specifies how deep in the rule hierarchy the parser currently is. The argument pos holds the position from the start of the current context in bytes. In case of event == PCC_DBG_MATCH, the argument buffer holds the matched input and length is its size. For other events, buffer and length indicate a part of the currently loaded input, which is used to evaluate the current rule.

Caution: Since version 1.6.0, the first argument auxil is added to this macro. The user-defined data passed to the API function pcc_create() can be retrieved from this argument.

There are currently three supported events:

  • PCC_DBG_EVALUATE (= 0) - called when the parser starts to evaluate rule
  • PCC_DBG_MATCH (= 1) - called when rule is matched, at which point buffer holds entire matched string
  • PCC_DBG_NOMATCH (= 2) - called when the parser determines that the input does not match currently evaluated rule

A very simple implementation could look like this:

static const char *dbg_str[] = { "Evaluating rule", "Matched rule", "Abandoning rule" };
#define PCC_DEBUG(auxil, event, rule, level, pos, buffer, length) \
    fprintf(stderr, "%*s%s %s @%zu [%.*s]\n", (int)((level) * 2), "", dbg_str[event], rule, pos, (int)(length), buffer)

The default is to do nothing:

#define PCC_DEBUG(auxil, event, rule, level, pos, buffer, length) ((void)0)

PCC_BUFFERSIZE

The initial size (the number of characters) of the text buffer. The text buffer is expanded as needed. The default is 256.

PCC_ARRAYSIZE

The initial size (the number of elements) of the internal arrays other than the text buffer. The arrays are expanded as needed. The default is 2.

API

The parser API has only 3 simple functions below.

pcc_context_t *pcc_create(void *auxil);

Creates a parser context. This context needs to be passed to the functions below. The auxil can be used to pass user-defined data to be bound to the context. NULL can be specified if no user-defined data.

int pcc_parse(pcc_context_t *ctx, int *ret);

Parses an input text (from the standard input by default) and returns the result in ret. The ret can be NULL if no output data is needed. This function returns 0 if no text is left to be parsed, or a nonzero value otherwise.

void pcc_destroy(pcc_context_t *ctx);

Destroys the parser context. All resources allocated in the parser context are released.

The type of output data ret can be changed. If you want change it to char *, specify %value "char *" in the PEG source. The default is int.

The type of user-defined data auxil can be changed. If you want change it to long, specify %auxil "long" in the PEG source. The default is void *.

The prefix pcc can be changed. If you want change it to foo, specify %prefix "foo" in the PEG source. The default is pcc.

After the above settings, the API functions change like below.

foo_context_t *foo_create(long auxil);
int foo_parse(foo_context_t *ctx, char **ret);
void foo_destroy(foo_context_t *ctx);

The typical usage of the API functions is shown below.

int ret;
pcc_context_t *ctx = pcc_create(NULL);
while (pcc_parse(ctx, &ret));
pcc_destroy(ctx);

Examples

Desktop Calculator

A simple example which provides interactive four arithmetic operations of integers is shown here. Note that left-recursive grammar rules are defined in this example.

%prefix "calc"

%source {
#include <stdio.h>
#include <stdlib.h>
}

statement <- _ e:expression _ EOL { printf("answer=%d\n", e); }
           / ( !EOL . )* EOL      { printf("error\n"); }

expression <- e:term { $$ = e; }

term <- l:term _ '+' _ r:factor { $$ = l + r; }
      / l:term _ '-' _ r:factor { $$ = l - r; }
      / e:factor                { $$ = e; }

factor <- l:factor _ '*' _ r:unary { $$ = l * r; }
        / l:factor _ '/' _ r:unary { $$ = l / r; }
        / e:unary                  { $$ = e; }

unary <- '+' _ e:unary { $$ = +e; }
       / '-' _ e:unary { $$ = -e; }
       / e:primary     { $$ = e; }

primary <- < [0-9]+ >               { $$ = atoi($1); }
         / '(' _ e:expression _ ')' { $$ = e; }

_      <- [ \t]*
EOL    <- '\n' / '\r\n' / '\r' / ';'

%%
int main() {
    calc_context_t *ctx = calc_create(NULL);
    while (calc_parse(ctx, NULL));
    calc_destroy(ctx);
    return 0;
}

An execution example is as follows.

$ ./calc↵
1+2*(3+4*(5+6))↵
answer=95
5*6*7*8/(1*2*3*4)↵
answer=70

Simple AST builder

An example which builds an AST (abstract syntax tree) and dumps it is shown here. This example accepts the same inputs as Desktop Calculator shown above.

%prefix "calc"

%value "calc_ast_node_t *"    # <-- must be set

%auxil "calc_ast_manager_t *" # <-- must be set

%header {
#define CALC_AST_NODE_CUSTOM_DATA_DEFINED /* <-- enables node custom data */

typedef struct text_data_tag { /* <-- node custom data type */
    char *text;
} calc_ast_node_custom_data_t;
}

%source {
#include <stdio.h>
#include <string.h>
}

statement <- _ e:expression _ EOL { $$ = e; }
           / ( !EOL . )* EOL      { $$ = NULL; }

expression <- e:term { $$ = e; }

term <- l:term _ '+' _ r:factor { $$ = calc_ast_node__create_2(l, r); $$->custom.text = strdup("+"); }
      / l:term _ '-' _ r:factor { $$ = calc_ast_node__create_2(l, r); $$->custom.text = strdup("-"); }
      / e:factor                { $$ = e; }

factor <- l:factor _ '*' _ r:unary { $$ = calc_ast_node__create_2(l, r); $$->custom.text = strdup("*"); }
        / l:factor _ '/' _ r:unary { $$ = calc_ast_node__create_2(l, r); $$->custom.text = strdup("/"); }
        / e:unary                  { $$ = e; }

unary <- '+' _ e:unary { $$ = calc_ast_node__create_1(e); $$->custom.text = strdup("+"); }
       / '-' _ e:unary { $$ = calc_ast_node__create_1(e); $$->custom.text = strdup("-"); }
       / e:primary     { $$ = e; }

primary <- < [0-9]+ >               { $$ = calc_ast_node__create_0(); $$->custom.text = strdup($1); }
         / '(' _ e:expression _ ')' { $$ = e; }

_      <- [ \t]*
EOL    <- '\n' / '\r\n' / '\r' / ';'

%import "code/pcc_ast.peg"   # <-- provides AST build functions

%%
void calc_ast_node_custom_data__initialize(calc_ast_node_custom_data_t *obj) { /* <-- must be implemented when enabling node custom data */
    obj->text = NULL;
}

void calc_ast_node_custom_data__finalize(calc_ast_node_custom_data_t *obj) {   /* <-- must be implemented when enabling node custom data */
    free(obj->text);
}

static void dump_ast(const calc_ast_node_t *obj, int depth) {
    if (obj) {
        switch (obj->type) {
        case CALC_AST_NODE_TYPE_NULLARY:
            printf("%*s%s: \"%s\"\n", 2 * depth, "", "nullary", obj->custom.text);
            break;
        case CALC_AST_NODE_TYPE_UNARY:
            printf("%*s%s: \"%s\"\n", 2 * depth, "", "unary", obj->custom.text);
            dump_ast(obj->data.unary.node, depth + 1);
            break;
        case CALC_AST_NODE_TYPE_BINARY:
            printf("%*s%s: \"%s\"\n", 2 * depth, "", "binary", obj->custom.text);
            dump_ast(obj->data.binary.node[0], depth + 1);
            dump_ast(obj->data.binary.node[1], depth + 1);
            break;
        case CALC_AST_NODE_TYPE_TERNARY:
            printf("%*s%s: \"%s\"\n", 2 * depth, "", "ternary", obj->custom.text);
            dump_ast(obj->data.ternary.node[0], depth + 1);
            dump_ast(obj->data.ternary.node[1], depth + 1);
            dump_ast(obj->data.ternary.node[2], depth + 1);
            break;
        case CALC_AST_NODE_TYPE_VARIADIC:
            printf("%*s%s: \"%s\"\n", 2 * depth, "", "variadic", obj->custom.text);
            {
                size_t i;
                for (i = 0; i < obj->data.variadic.len; i++) {
                    dump_ast(obj->data.variadic.node[i], depth + 1);
                }
            }
            break;
        default:
            printf("%*s%s: \"%s\"\n", 2 * depth, "", "(unknown)", obj->custom.text);
            break;
        }
    }
    else {
        printf("%*s(null)\n", 2 * depth, "");
    }
}

int main(int argc, char **argv) {
    calc_ast_manager_t mgr;
    calc_ast_manager__initialize(&mgr);
    {
        calc_context_t *ctx = calc_create(&mgr);
        calc_ast_node_t *ast = NULL;
        while (calc_parse(ctx, &ast)) {
            dump_ast(ast, 0);
            calc_ast_node__destroy(ast);
        }
        calc_destroy(ctx);
    }
    calc_ast_manager__finalize(&mgr);
    return 0;
}

The key point is the line %import "code/pcc_ast.peg". The import file code/pcc_ast.peg makes it easier to build ASTs. For more details, see here.

An execution example is as follows.

$ ./ast-calc↵
1+2*(3+4*(5+6))↵
binary: "+"
  nullary: "1"
  binary: "*"
    nullary: "2"
    binary: "+"
      nullary: "3"
      binary: "*"
        nullary: "4"
        binary: "+"
          nullary: "5"
          nullary: "6"
5*6*7*8/(1*2*3*4)↵
binary: "/"
  binary: "*"
    binary: "*"
      binary: "*"
        nullary: "5"
        nullary: "6"
      nullary: "7"
    nullary: "8"
  binary: "*"
    binary: "*"
      binary: "*"
        nullary: "1"
        nullary: "2"
      nullary: "3"
    nullary: "4"

AST Builder for Tiny-C

You can find the more practical example in the directory examples/ast-tinyc. It builds an AST from an input source file written in Tiny-C and dumps the AST.