Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Symbol tables #31

Open
ChrisHixon opened this issue Jul 7, 2022 · 16 comments
Open

Symbol tables #31

ChrisHixon opened this issue Jul 7, 2022 · 16 comments

Comments

@ChrisHixon
Copy link
Owner

Proposed syntax for symbol table extension

# chpeg proposed symbol table extension grammar (work in progress...)

# Hierarchical syntax
Grammar     {S} <- S (Definition / TableInit)+ !.
Definition      <- Identifier S
                    Scopes?
                    ('{' S Options S '}' S)?
                    '<-' S Choice
Choice          <- Sequence ('/' S Sequence)*
Sequence        <- Element+
Element         <- Predicate / Trim
Predicate       <- (PredOp S)? Repeat
Repeat          <- Primary (RepOp S / '{' S MinMax S '}' S )?
Primary         <- Identifier S !Keyword !('{' S ![0-9]) !'<-'
                 / '(' S Choice ')' S
                 / Reference S
                 / Mark S
                 / TableDecl S
                 / TableRef S
                 / Literal S
                 / CharClass S
                 / NCharClass S
                 / Dot S

# SYM extension
TableInit       <- '%' Identifier S '=' S Literal S (Literal S)*
Scopes          <- ('SCOPES' S '(' S '%' Identifier S (',' S '%' Identifier S)* ')' ) S
TableDecl   {S} <- '%' Identifier '<' S Choice '>'
TableRef    {S} <- '%' Identifier !(S '=')

# REFS extension
Mark        {S} <- '$' Identifier '<' S Choice '>'
Reference   {S} <- '$' Identifier !'<'

# TRIM extension
Trim        {S} <- '<' S Choice '>' S

# MINMAX extension
MinMax      {S} <- MinMaxVal (S Comma S MinMaxVal?)?
                 / Comma S MinMaxVal
MinMaxVal   {L} <- [1-9][0-9]* / '0'

# Lexical syntax
Keyword         <- 'SCOPES'
Options         <- [CSILTPR]+
Identifier  {L} <- !Keyword [a-zA-Z_][a-zA-Z_0-9]*
Literal     {S} <- ['] (!['] Char)* [']
                 / ["] (!["] Char)* ["]
CharClass   {S} <- '[' !'^' (!']' CharRange)+ ']'
NCharClass  {S} <- '[^' (!']' CharRange)+ ']'
CharRange       <- Char '-' !']' Char / Char
Char            <- PlainChar / HexChar / EscChar / OctChar
EscChar     {L} <- '\\' [ntr'"\[\]\\fvab]
OctChar     {L} <- '\\' [0-3][0-7][0-7]
                 / '\\' [0-7][0-7]?
HexChar     {L} <- '\\x' [0-9a-fA-F][0-9a-fA-F]?
PlainChar   {L} <- !'\\' .
PredOp      {L} <- [&!]
RepOp       {L} <- [*+?]
Dot         {L} <- '.'
Comma       {L} <- ','

S           {I} <- ([ \t\r\n]+ / '#' (![\r\n] .)* [\r\n]? )*

Usage example:

Start <- _ (Ref / Decl)+ !.
Block SCOPES(%table) <- '{ ... }' # create a new lexical scope for %table (see Scopes)
Decl <- 'decl' %table<[a-z]+> _   # declare a table value (see TableDecl)
Ref <- 'ref' %table _             # match a value in %table (see TableRef)
%table = 'foo' 'bar'              # initialize a symbol table (see TableInit)
_ <- [ \t\n\r]*

# decl foo # Error: 'foo' already declared

# ref foo  # ok

# decl a   # ok (%table = 'foo' 'bar' 'a')
# ref a    # ok
# ref b    # Error: 'b' is not declared
@ChrisHixon
Copy link
Owner Author

@mingodad, any thoughts on this topic?

@ChrisHixon
Copy link
Owner Author

Note that since symbol tables can be initialized, they can be used as a 'dictionary' as well, for keyword lists, etc.; use anywhere where you want to match a list of literals more efficiently than listing them out as choices: init: %table = 'foo' 'bar', match: %table; rather than table <- 'foo' / 'bar' and match table.

@mingodad
Copy link
Contributor

mingodad commented Jul 7, 2022

I need some time to digest it but at first it seems that it'll cover a lot of use cases.

@mingodad
Copy link
Contributor

mingodad commented Jul 7, 2022

It seems that now there is only one literal keyword in the proposed grammar SCOPE, somehow it come to my mind that maybe a different way to add new extensions without need a bare literal/keyword can be useful.
Like now that the grammar will receive several improvements we could also add the inline AST label/name with such a general mechanism.
Also the cut operator and label for error handling like in lpeglabel https://github.com/sqmedeiros/lpeglabel/#relabel-syntax .

@mingodad
Copy link
Contributor

mingodad commented Jul 7, 2022

A full working grammar with good error recovering like in lpeglabel will end up awful but at the end somehow to use the grammar we'll need that functionality, with a tool to extract naked grammar other purposes like railroad diagram/debug/...

@mingodad
Copy link
Contributor

mingodad commented Jul 7, 2022

And just playing around I add a terminator (; but any other can be used) for definitions to eliminate the ambiguity of when it ends.

Definition      <- Identifier S
                    Scopes?
                    ('{' S Options S '}' S)?
                    '<-' S Choice ';' S

Then Primary:

Primary         <- Identifier S !Keyword #<< not needed anymore !('{' S ![0-9]) !'<-'

@mingodad
Copy link
Contributor

mingodad commented Jul 7, 2022

Here is the proposed grammar with definition terminator and SCOPE replaced by @:

# chpeg proposed symbol table extension grammar (work in progress...)

# Hierarchical syntax
Grammar     {S} <- S (Definition / TableInit)+ !.
Definition      <- Identifier S
                    Scopes?
                    ('{' S Options S '}' S)?
                    '<-' S Choice Terminator S
Choice          <- Sequence ('/' S Sequence)*
Sequence        <- Element+
Element         <- Predicate / Trim
Predicate       <- (PredOp S)? Repeat
Repeat          <- Primary (RepOp S / '{' S MinMax S '}' S )?
Primary         <- Identifier S
                 / '(' S Choice ')' S
                 / Reference S
                 / Mark S
                 / TableDecl S
                 / TableRef S
                 / Literal S
                 / CharClass S
                 / NCharClass S
                 / Dot S

# SYM extension
TableInit       <- '%' Identifier S '=' S Literal S (Literal S)* Terminator S
Scopes          <- ('@' S '(' S '%' Identifier S (',' S '%' Identifier S)* ')' ) S
TableDecl   {S} <- '%' Identifier '<' S Choice '>'
TableRef    {S} <- '%' Identifier !(S '=')

# REFS extension
Mark        {S} <- '$' Identifier '<' S Choice '>'
Reference   {S} <- '$' Identifier !'<'

# TRIM extension
Trim        {S} <- '<' S Choice '>' S

# MINMAX extension
MinMax      {S} <- MinMaxVal (S Comma S MinMaxVal?)?
                 / Comma S MinMaxVal
MinMaxVal   {L} <- [1-9][0-9]* / '0'

# Lexical syntax
Options         <- [CSILTPR]+
Identifier  {L} <- [a-zA-Z_][a-zA-Z_0-9]*
Literal     {S} <- ['] (!['] Char)* [']
                 / ["] (!["] Char)* ["]
CharClass   {S} <- '[' !'^' (!']' CharRange)+ ']'
NCharClass  {S} <- '[^' (!']' CharRange)+ ']'
CharRange       <- Char '-' !']' Char / Char
Char            <- PlainChar / HexChar / EscChar / OctChar
EscChar     {L} <- '\\' [ntr'"\[\]\\fvab]
OctChar     {L} <- '\\' [0-3][0-7][0-7]
                 / '\\' [0-7][0-7]?
HexChar     {L} <- '\\x' [0-9a-fA-F][0-9a-fA-F]?
PlainChar   {L} <- !'\\' .
PredOp      {L} <- [&!]
RepOp       {L} <- [*+?]
Dot         {L} <- '.'
Comma       {L} <- ','
Terminator  {L} <- ';'

S           {I} <- ([ \t\r\n]+ / '#' (![\r\n] .)* [\r\n]? )*

And here the sample usage:

Start <- _ (Ref / Decl)+ !. ;
Block @(%table) <- '{ ... }' ; # create a new lexical scope for %table (see Scopes)
Decl <- 'decl' %table<[a-z]+> _ ;  # declare a table value (see TableDecl)
Ref <- 'ref' %table _ ;            # match a value in %table (see TableRef)
%table = 'foo' 'bar' ;             # initialize a symbol table (see TableInit)
_ <- [ \t\n\r]* ;

# decl foo # Error: 'foo' already declared

# ref foo  # ok

# decl a   # ok (%table = 'foo' 'bar' 'a')
# ref a    # ok
# ref b    # Error: 'b' is not declared

@mingodad
Copy link
Contributor

mingodad commented Jul 7, 2022

It's not clear to me the SCOPES concept that you are using, what it mean ?
Create a copy of %table identifiers and use then only for that definition and destroyed after it ?

@mingodad
Copy link
Contributor

mingodad commented Jul 7, 2022

Another common construction like Options <- [CSILTPR]+ meaning one or more unique, is encountered in grammars and if we could have a way to specify it at the grammar level would be nice (https://github.com/Lercher/CocoR#useonce-and-useall-scopes-for-symbols or something similar)

@ChrisHixon
Copy link
Owner Author

It's not clear to me the SCOPES concept that you are using, what it mean ? Create a copy of %table identifiers and use then only for that definition and destroyed after it ?

Lexical scopes, the same idea as in SCOPES in https://github.com/Lercher/CocoR#scoped-symbols

I like the ideas there but not sure I like how everything is defined in their syntax. They have this on a rule:

// Coco/R grammar
Production = ident [FormalAttributes] [ScopesDecl] [ UseOnceDecl ] [ UseAllDecl ] [LocalDecl] '=' Expression '.'.
ScopesDecl  = "SCOPES"  '(' ident { ',' ident } ')' .
UseOnceDecl = "USEONCE" '(' ident { ',' ident } ')' .
UseAllDecl  = "USEALL"  '(' ident { ',' ident } ')' .

And then another area where they initialize and set if a table is strict or not (which I suppose could be added on to chpeg's TableInit.) Why is that part on the table init/definition, but USEONCE and USEALL are on the rule definition? STRICT to me seems like something similar to those two concepts; why would it be global for a table but USEONCE and USEALL are local to a rule?

I figured I would add this kind of stuff later but perhaps the concepts and syntax should be ironed out first.

@ChrisHixon
Copy link
Owner Author

ChrisHixon commented Jul 7, 2022

And just playing around I add a terminator (; but any other can be used) for definitions to eliminate the ambiguity of when it ends.

Definition      <- Identifier S
                    Scopes?
                    ('{' S Options S '}' S)?
                    '<-' S Choice ';' S

Then Primary:

Primary         <- Identifier S !Keyword #<< not needed anymore !('{' S ![0-9]) !'<-'

I wish PEG had a terminator in the first place... I'd rather not break PEG compatibility like this though.

Another thought I had was to put every pre-<- rule option into Options, so inside { .. }...

Block {SCOPES(%table) S} <- '{' ... '}' # 'S' here is the S for Stop, original Options.

Which would simplify that Identifier check to something like:

Primary         <- Identifier S !('{' S [A-Za-z]) !'<-'
# or maybe
Primary         <- Identifier S !('{' S ![,0-9]) !'<-'

Somehow, it has to differentiate the rule Options in {} from MinMax {} syntax. I just realized that Comma has to be in there as it could be first char.

@ChrisHixon
Copy link
Owner Author

It seems that now there is only one literal keyword in the proposed grammar SCOPE, somehow it come to my mind that maybe a different way to add new extensions without need a bare literal/keyword can be useful. Like now that the grammar will receive several improvements we could also add the inline AST label/name with such a general mechanism. Also the cut operator and label for error handling like in lpeglabel https://github.com/sqmedeiros/lpeglabel/#relabel-syntax .

Not sure I follow completely, and I get a bit lost looking at LPEG syntax (not really familiar with it). Those are other extensions to consider, but I think the main syntax focus/issue here is related to extended rule-level options on the left of <-.

@mingodad
Copy link
Contributor

mingodad commented Jul 7, 2022

And that's why I gave an example of replacing SCOPE by @ (or any other symbol instead of a keyword) :

Scopes          <- ('@' S '(' S '%' Identifier S (',' S '%' Identifier S)* ')' ) S

@ChrisHixon
Copy link
Owner Author

ChrisHixon commented Jul 7, 2022

It's not clear to me the SCOPES concept that you are using, what it mean ? Create a copy of %table identifiers and use then only for that definition and destroyed after it ?

A new lexical scope would: create an empty symbol table for that scope; on set it would set the value in the nearest scope. On lookup it would search the nearest scope and then search outer scopes, nearest first. At least this is how I understand it, and how I implemented the similar concept of reference scopes ({R} option.)

To further expand on this, symbol tables will not be destroyed, except on backtracking in a way that 'undoes' them. This is the nature of PEG and doing semantic/context-sensitive things-mid processing (vs waiting until parse complete). I believe that 'destroyed' explanation applies to their LL(k) implementation. Like ref-scopes (which store the back-reference values), I plan to place the symbol table storage inside the parse-tree nodes to solve many issues and make it packrat compatible, issues I've discussed previously in a few places. Packrat will probably make sense to not undo/repeat a bunch of symbol table work on backtracking, but this really depends on grammar. It seems like the grammars like C that would use such symbol table features do a lot of backtracking.

@ChrisHixon
Copy link
Owner Author

I just thought of something else related to backtracking: are symbol table related errors like parse errors, i.e. not actually errors until backtracking tries all options? I'm thinking it is probably possible to cause a symbol table error (e.g. "'foo' not defined") that is actually backtracked and discarded. Sometimes PEG is maddening.

@mingodad
Copy link
Contributor

mingodad commented Jul 7, 2022

As you've said the important thing here is to find a proper way to describe then in the grammar but it would be interesting to play with some initial implementation to test with a few grammars and see how it feel/work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants