-
-
Notifications
You must be signed in to change notification settings - Fork 159
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to tokenize correctly when missing whitespace? #726
Comments
Hmm, should it? This seems like an unintuitive thing for the lexer to do. Is there a particular reason that you want to sometimes lex |
Yeah I agree this is a bit unintuitive, this particular language spec isn't one I control, rather its a standard academic toy language spec to test the abilities of whoever is implementing it with weird parsing edge cases. The spec requires that integer literals fit within signed (two's compliment) 32 bits, as a syntactic requirement. So for example if So basically, the lexer removes whitespace, making it difficult to determine if an integer literal does or doesn't conform to the language spec. I guess the lexer could include whitespace/comment tokens in the output token stream? Though that will mean the parser will have to deal with white-spacing as well, which might be a downside. Overall, the current solution I have works, but I have a hunch there is a better method/pattern/etc. to achieve this with this library, I just haven't come across it yet |
In this case, whenever you parse a number, allow there to be a leading |
I'm a little bit confused at which stage I should parse a leading However if its in the parsing stage, e.g. |
Problem
If I have a grammar for a language like this:
then its clear that lexing characters '+' and '-' can be ambiguous. So I adopted the maximal munch resolution strategy: if encountering plus/minus try and parse an integer first, otherwise parse as the symbol itself.
So for instance if my token type is:
then
3 + -32
gets tokenized to[IntLiteral(3), Plus, IntLiteral(-32)]
which is fine however this does break when considering if the whitespace is missing. The grammar of the language is such that this:should tokenize to
[IntLiteral(3), Plus, IntLiteral(2), IntLiteral(7), Minus, IntLiteral(8)]
, however the maximal munch strategy produces the (incorrect) token stream[IntLiteral(3), IntLiteral(2), IntLiteral(7), IntLiteral(-8)]
, and I'm not really sure how to fix it elegantly.My current solution
Currently my strategy is to define an intermediate token type, along with an extra lexing "pass":
where I create a disambiguation parser inside
unflattened
that works as following:Vec<Token>
Vec<Token>
accumulatorUnflattenedToken::ToFlatten
Then the final token parser is a choice parser with the following precedence:
UnflattenedToken::Token
UnflattenedToken::Token
UnflattenedToken::Token
and then that is padded and repeated, to produce
Vec<UnflattenedToken>
Then the
lexer
internally callsunflattened
to get aVec<UnflattenedToken>
and maps it toVec<Token>
by building up an output vector:UnflattenedToken::Token
then just push to output vector directlyUnflattenedToken::ToFlatten
then append the whole nested token stream to output vector.This works, but now I have two lexing passes, and I'm not sure how this all interacts with error recovery or error reporting.
Ideally, the solution would be??
The most ideal solution would be if I could just parse ahead, and decide (based on that) how many output tokens I want to produce. So if the parser comes across the "exception pattern" then it can keep consuming and emmiting tokens for as long as that pattern is not broken. And if it doesn't come across that pattern, it can just emmit one integer literal token as normal.
And at the end, when the padded-token is repeated, somehow the final iter-parser can flatten any nested iter-parser output
which gets "flattened" to
[Tok1, Tok2, Tok3, Tok4, Tok5, Tok6, Tok7, Tok8, Tok9, Tok10]
with some kind of.flatten()
? Or perhaps there could be like aniter_choice((parser1, parser2, ...))
which accepts normal parsers and iter-parsers, and produces an iter-parser that flattens this stuff? not sure. Ether way, I could not find anything that looks like this in the current library, and I imagine the underlying implementations of iter-parsers and parsers are very different, making a suggestion for a combinator like this hard to implement. So the "ideal" solution sounds like a bit of a no-go for now.I also thought about using contexts, because I notice there is clearly some kind of (left)-sensitive context here. Namely, the parsing of '+'/'-' depends on if an integer literal was (or wasn't) parsed right before it (modulo whitespace/comments) however I just can't seem to wrap my head around how the current parser-context API works, such that I could use it to achieve this. Theres gotta be some combination of
.rewind
,.and_is
,.configure
and.with_ctx
that can cobble together a sensitive '+'/'-' parser. I just don't know how I would propogate that context from invocation-to-invocation, since I'm.repeat
-ing achoice
parser, and one of the choices is context sensitive to what came before it.Speculations
Or maybe theres a completely different approach to this alltogether, such that this is not really an issue? Or maybe my current working approach is the best there is for now (atleast for the forseeable future)?
Any replies would be greatly appreciated.
The text was updated successfully, but these errors were encountered: