Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to tokenize correctly when missing whitespace? #726

Open
AndreiCravtov opened this issue Feb 4, 2025 · 4 comments
Open

How to tokenize correctly when missing whitespace? #726

AndreiCravtov opened this issue Feb 4, 2025 · 4 comments

Comments

@AndreiCravtov
Copy link

Problem

If I have a grammar for a language like this:

...
<int-literal> ::= ( '+' | '-' )? ('0'-'9')+
<unary>       ::= '-' | '!'
<binary>      ::= '+' | '-'
...

then its clear that lexing characters '+' and '-' can be ambiguous. So I adopted the maximal munch resolution strategy: if encountering plus/minus try and parse an integer first, otherwise parse as the symbol itself.

So for instance if my token type is:

enum Tok {
    IntLiteral(i32),
    Plus,
    Minus,
    ...
}

then 3 + -32 gets tokenized to [IntLiteral(3), Plus, IntLiteral(-32)] which is fine however this does break when considering if the whitespace is missing. The grammar of the language is such that this:

3+2
7       # indefinite amount of comments, newlines, and whitespace
      -8

should tokenize to [IntLiteral(3), Plus, IntLiteral(2), IntLiteral(7), Minus, IntLiteral(8)], however the maximal munch strategy produces the (incorrect) token stream [IntLiteral(3), IntLiteral(2), IntLiteral(7), IntLiteral(-8)], and I'm not really sure how to fix it elegantly.

My current solution

Currently my strategy is to define an intermediate token type, along with an extra lexing "pass":

enum UnflattenedToken {
    Token(Token),
    ToFlatten(Vec<Token>)
}

fn unflattened() -> impl Parser<..., Vec<UnflattenedToken>,...> { ... }
fn lexer() -> impl Parser<..., Vec<Token>,...> { ... }

where I create a disambiguation parser inside unflattened that works as following:

  1. parse an integer literal, then any amount of whitespace and comments, then plus/minus, then integer literal again
  2. this means we have begun the "exception pattern", so map those three to a Vec<Token>
  3. repeatedly parse: any amount of whitespace and comments, then plus/minus, then integer literal again, folding each repetition by pushing the tokens that match to the Vec<Token> accumulator
  4. at the and, map it to UnflattenedToken::ToFlatten

Then the final token parser is a choice parser with the following precedence:

  1. the disabmiguation parser as described above
  2. integer literal parser, mapped to UnflattenedToken::Token
  3. plus/plus minus parser, mapped to UnflattenedToken::Token
  4. all other parsers for other tokens, mapped to UnflattenedToken::Token

and then that is padded and repeated, to produce Vec<UnflattenedToken>

Then the lexer internally calls unflattened to get a Vec<UnflattenedToken> and maps it to Vec<Token> by building up an output vector:

  1. if UnflattenedToken::Token then just push to output vector directly
  2. if UnflattenedToken::ToFlatten then append the whole nested token stream to output vector.

This works, but now I have two lexing passes, and I'm not sure how this all interacts with error recovery or error reporting.

Ideally, the solution would be??

The most ideal solution would be if I could just parse ahead, and decide (based on that) how many output tokens I want to produce. So if the parser comes across the "exception pattern" then it can keep consuming and emmiting tokens for as long as that pattern is not broken. And if it doesn't come across that pattern, it can just emmit one integer literal token as normal.
And at the end, when the padded-token is repeated, somehow the final iter-parser can flatten any nested iter-parser output

Tok1, // token produced by a single-output parser
Tok2, Tok3, Tok4, // nested token stream produced by iter-parser
Tok5,
Tok6,
Tok7,
Tok8, Tok9, Tok10

which gets "flattened" to [Tok1, Tok2, Tok3, Tok4, Tok5, Tok6, Tok7, Tok8, Tok9, Tok10] with some kind of .flatten()? Or perhaps there could be like an iter_choice((parser1, parser2, ...)) which accepts normal parsers and iter-parsers, and produces an iter-parser that flattens this stuff? not sure. Ether way, I could not find anything that looks like this in the current library, and I imagine the underlying implementations of iter-parsers and parsers are very different, making a suggestion for a combinator like this hard to implement. So the "ideal" solution sounds like a bit of a no-go for now.

I also thought about using contexts, because I notice there is clearly some kind of (left)-sensitive context here. Namely, the parsing of '+'/'-' depends on if an integer literal was (or wasn't) parsed right before it (modulo whitespace/comments) however I just can't seem to wrap my head around how the current parser-context API works, such that I could use it to achieve this. Theres gotta be some combination of .rewind, .and_is, .configure and .with_ctx that can cobble together a sensitive '+'/'-' parser. I just don't know how I would propogate that context from invocation-to-invocation, since I'm .repeat-ing a choice parser, and one of the choices is context sensitive to what came before it.

Speculations

Or maybe theres a completely different approach to this alltogether, such that this is not really an issue? Or maybe my current working approach is the best there is for now (atleast for the forseeable future)?

Any replies would be greatly appreciated.

@zesterer
Copy link
Owner

zesterer commented Feb 4, 2025

should tokenize to [IntLiteral(3), Plus, IntLiteral(2), IntLiteral(7), Minus, IntLiteral(8)]

Hmm, should it? This seems like an unintuitive thing for the lexer to do.

Is there a particular reason that you want to sometimes lex -42 as [Int(-42)] instead of [Minus, Int(42)]? Differentiating between prefix and infix operators is already unambiguous when you come to actually perform parsing, and a post-parse pass can always simplify to the former if required.

@AndreiCravtov
Copy link
Author

Yeah I agree this is a bit unintuitive, this particular language spec isn't one I control, rather its a standard academic toy language spec to test the abilities of whoever is implementing it with weird parsing edge cases.

The spec requires that integer literals fit within signed (two's compliment) 32 bits, as a syntactic requirement. So for example if -2147483648 should be parsed as [Int(-2147483648)], and parsing it as [Minus, Int(2147483648)] means the integer literal will be larger than the allowed constraint even-though it isn't.
Conversely, - 2147483648 should be parsed as [Minus, Int(2147483648)] and a compiler error should be emitted; merging that token stream into [Int(-2147483648)] is incorrect according to the grammar, and this error will be silently suppressed.

So basically, the lexer removes whitespace, making it difficult to determine if an integer literal does or doesn't conform to the language spec. I guess the lexer could include whitespace/comment tokens in the output token stream? Though that will mean the parser will have to deal with white-spacing as well, which might be a downside.

Overall, the current solution I have works, but I have a hunch there is a better method/pattern/etc. to achieve this with this library, I just haven't come across it yet

@Zij-IT
Copy link
Contributor

Zij-IT commented Feb 4, 2025

In this case, whenever you parse a number, allow there to be a leading - whenever you parse a number. Then, whenever you do parse the leading -, try_map it to validate that the minus can be applied, and if not, emit an error.

@AndreiCravtov
Copy link
Author

I'm a little bit confused at which stage I should parse a leading - for an integer literal? If in the lexing stage, then that is the topic of this question: I have a working solution, but I think it is somewhat convoluted.

However if its in the parsing stage, e.g. [Minus, Int(2)] => IntLiteral(-2) then I the problem arises that I don't know if there was whitespace or not between these two tokens. If there was whitespace, it should parse to Unary(Minus, Int(2)) instead. If there wasn't, it should parse to IntLiteral(-2). The only way for the parser to know is for the lexer to additionally emit whitespace tokens.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants