Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Earley parser produces non-deterministic results even with explicit terminal weights #201

Closed
shawnanastasio opened this issue Jul 31, 2018 · 7 comments
Milestone

Comments

@shawnanastasio
Copy link

Perhaps related to #191

When using the earley parser with weighted terminals, the result is still non-deterministic in certain cases.

Example:

from lark import Lark

grammar = """
root: A | B

A.1: ("0".."9")+
B.2: ("0".."9")+

%import common.WS
%ignore WS
"""

parser = Lark(grammar, start="root", parser="earley")
print(parser.parse("10"))

It is my understanding that using weights was a workaround for this non-determinism (as indicated in #167), but this does not seem to be the case.

I am using the latest lark version available on pip (0.6.2) on Fedora Linux 27 x86_64.

@erezsh
Copy link
Member

erezsh commented Jul 31, 2018

Earley currently does not support weights on terminals. But it should work as you expect it to with weights on rules. Such as:

root: a | b 

a.1: ("0".."9")+
b.2: ("0".."9")+

This is probably not the best behavior. Hopefully I'll get the chance to fix it soon.

@shawnanastasio
Copy link
Author

I see. Perhaps you could help me work around this issue with a project I'm working on.

In the project I have terminals similar to the following for unquoted strings, decimal numbers, and hexadecimal numbers:

root: STR | DEC | HEX

STR: /\S/+
DEC: ("0".."9")+
HEX: ("0x" ("0".."9" | "A".."F" | "a".."f")+)

I'd like to be able to have things like "10" always parse as a decimal and "0xFF" as a hexadecimal, but the non-deterministic behavior is causing them to sometimes be interpreted as strings.

Is there an easy way I can refactor this to work around the unexpected behavior?

@erezsh
Copy link
Member

erezsh commented Jul 31, 2018

Of course. Just add intermediate rules.

root: STR | DEC | hex

hex.2: HEX

etc.

Hope this helps!

@shawnanastasio
Copy link
Author

Thanks for the advice. It seems to work in some cases but not in others.

For example, with the grammar and text below, the hexadecimal literal seems to be interpreted as a string unless I remove the string terminal entirely, in which case it is interpreted as a hexadecimal.

from lark import Lark

grammar = """
root: "BEGIN" statement+ "END"
statement: assignment

assignment: IDENTIFIER "=" rvalue
?rvalue: decimal
      | hexadecimal
      | STRING

IDENTIFIER: ("A".."z" | "0".."9")+

// basic types
STRING: /\S/+
DECIMAL: ("0".."9")+
HEXADECIMAL: ("0x" ("0".."9" | "A".."F" | "a".."f")+)

// wrappers for ambiguous types above
decimal.2: DECIMAL
hexadecimal.2: HEXADECIMAL

%import common.WS
%ignore WS
"""

text = """
BEGIN
    MY_VAR = 0xDEADBEEF
END
"""

parser = Lark(grammar, start="root", parser="earley")
print(parser.parse(text))

I'm not sure if this is an issue with my grammar or with the parser. The lalr parser is able to produce the correct behavior every time, but it is unfortunately unsuitable for the project I'm currently working on.

Any assistance would be greatly appreciated.

@erezsh
Copy link
Member

erezsh commented Aug 1, 2018

Hmm yeah, this is an actual bug with the current implementation.

I'm currently working on a new earley implementation that should solve this problem. You can check out the branch earley_sppf and try it for yourself. It's still under work, but it's fairly stable. Let me know how it works out for you!

@shawnanastasio
Copy link
Author

I was able to switch over to lalr for the time being. Thank you very much for your help.

If you feel that this issue falls under the scope of #191, please feel free to close it.

@erezsh
Copy link
Member

erezsh commented Mar 28, 2019

0.7 published

@erezsh erezsh closed this as completed Mar 28, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants