How to tokenize correctly when missing whitespace? #726

AndreiCravtov · 2025-02-04T02:03:34Z

Problem

If I have a grammar for a language like this:

...
<int-literal> ::= ( '+' | '-' )? ('0'-'9')+
<unary>       ::= '-' | '!'
<binary>      ::= '+' | '-'
...

then its clear that lexing characters '+' and '-' can be ambiguous. So I adopted the maximal munch resolution strategy: if encountering plus/minus try and parse an integer first, otherwise parse as the symbol itself.

So for instance if my token type is:

enum Tok {
    IntLiteral(i32),
    Plus,
    Minus,
    ...
}

then 3 + -32 gets tokenized to [IntLiteral(3), Plus, IntLiteral(-32)] which is fine however this does break when considering if the whitespace is missing. The grammar of the language is such that this:

3+2
7       # indefinite amount of comments, newlines, and whitespace
      -8

should tokenize to [IntLiteral(3), Plus, IntLiteral(2), IntLiteral(7), Minus, IntLiteral(8)], however the maximal munch strategy produces the (incorrect) token stream [IntLiteral(3), IntLiteral(2), IntLiteral(7), IntLiteral(-8)], and I'm not really sure how to fix it elegantly.

My current solution

Currently my strategy is to define an intermediate token type, along with an extra lexing "pass":

enum UnflattenedToken {
    Token(Token),
    ToFlatten(Vec<Token>)
}

fn unflattened() -> impl Parser<..., Vec<UnflattenedToken>,...> { ... }
fn lexer() -> impl Parser<..., Vec<Token>,...> { ... }

where I create a disambiguation parser inside unflattened that works as following:

parse an integer literal, then any amount of whitespace and comments, then plus/minus, then integer literal again
this means we have begun the "exception pattern", so map those three to a Vec<Token>
repeatedly parse: any amount of whitespace and comments, then plus/minus, then integer literal again, folding each repetition by pushing the tokens that match to the Vec<Token> accumulator
at the and, map it to UnflattenedToken::ToFlatten

Then the final token parser is a choice parser with the following precedence:

the disabmiguation parser as described above
integer literal parser, mapped to UnflattenedToken::Token
plus/plus minus parser, mapped to UnflattenedToken::Token
all other parsers for other tokens, mapped to UnflattenedToken::Token

and then that is padded and repeated, to produce Vec<UnflattenedToken>

Then the lexer internally calls unflattened to get a Vec<UnflattenedToken> and maps it to Vec<Token> by building up an output vector:

if UnflattenedToken::Token then just push to output vector directly
if UnflattenedToken::ToFlatten then append the whole nested token stream to output vector.

This works, but now I have two lexing passes, and I'm not sure how this all interacts with error recovery or error reporting.

Ideally, the solution would be??

The most ideal solution would be if I could just parse ahead, and decide (based on that) how many output tokens I want to produce. So if the parser comes across the "exception pattern" then it can keep consuming and emmiting tokens for as long as that pattern is not broken. And if it doesn't come across that pattern, it can just emmit one integer literal token as normal.
And at the end, when the padded-token is repeated, somehow the final iter-parser can flatten any nested iter-parser output

Tok1, // token produced by a single-output parser
Tok2, Tok3, Tok4, // nested token stream produced by iter-parser
Tok5,
Tok6,
Tok7,
Tok8, Tok9, Tok10

which gets "flattened" to [Tok1, Tok2, Tok3, Tok4, Tok5, Tok6, Tok7, Tok8, Tok9, Tok10] with some kind of .flatten()? Or perhaps there could be like an iter_choice((parser1, parser2, ...)) which accepts normal parsers and iter-parsers, and produces an iter-parser that flattens this stuff? not sure. Ether way, I could not find anything that looks like this in the current library, and I imagine the underlying implementations of iter-parsers and parsers are very different, making a suggestion for a combinator like this hard to implement. So the "ideal" solution sounds like a bit of a no-go for now.

I also thought about using contexts, because I notice there is clearly some kind of (left)-sensitive context here. Namely, the parsing of '+'/'-' depends on if an integer literal was (or wasn't) parsed right before it (modulo whitespace/comments) however I just can't seem to wrap my head around how the current parser-context API works, such that I could use it to achieve this. Theres gotta be some combination of .rewind, .and_is, .configure and .with_ctx that can cobble together a sensitive '+'/'-' parser. I just don't know how I would propogate that context from invocation-to-invocation, since I'm .repeat-ing a choice parser, and one of the choices is context sensitive to what came before it.

Speculations

Or maybe theres a completely different approach to this alltogether, such that this is not really an issue? Or maybe my current working approach is the best there is for now (atleast for the forseeable future)?

Any replies would be greatly appreciated.

The text was updated successfully, but these errors were encountered:

zesterer · 2025-02-04T15:08:44Z

should tokenize to [IntLiteral(3), Plus, IntLiteral(2), IntLiteral(7), Minus, IntLiteral(8)]

Hmm, should it? This seems like an unintuitive thing for the lexer to do.

Is there a particular reason that you want to sometimes lex -42 as [Int(-42)] instead of [Minus, Int(42)]? Differentiating between prefix and infix operators is already unambiguous when you come to actually perform parsing, and a post-parse pass can always simplify to the former if required.

AndreiCravtov · 2025-02-04T19:34:39Z

Yeah I agree this is a bit unintuitive, this particular language spec isn't one I control, rather its a standard academic toy language spec to test the abilities of whoever is implementing it with weird parsing edge cases.

The spec requires that integer literals fit within signed (two's compliment) 32 bits, as a syntactic requirement. So for example if -2147483648 should be parsed as [Int(-2147483648)], and parsing it as [Minus, Int(2147483648)] means the integer literal will be larger than the allowed constraint even-though it isn't.
Conversely, - 2147483648 should be parsed as [Minus, Int(2147483648)] and a compiler error should be emitted; merging that token stream into [Int(-2147483648)] is incorrect according to the grammar, and this error will be silently suppressed.

So basically, the lexer removes whitespace, making it difficult to determine if an integer literal does or doesn't conform to the language spec. I guess the lexer could include whitespace/comment tokens in the output token stream? Though that will mean the parser will have to deal with white-spacing as well, which might be a downside.

Overall, the current solution I have works, but I have a hunch there is a better method/pattern/etc. to achieve this with this library, I just haven't come across it yet

Zij-IT · 2025-02-04T21:36:55Z

In this case, whenever you parse a number, allow there to be a leading - whenever you parse a number. Then, whenever you do parse the leading -, try_map it to validate that the minus can be applied, and if not, emit an error.

AndreiCravtov · 2025-02-05T02:52:09Z

I'm a little bit confused at which stage I should parse a leading - for an integer literal? If in the lexing stage, then that is the topic of this question: I have a working solution, but I think it is somewhat convoluted.

However if its in the parsing stage, e.g. [Minus, Int(2)] => IntLiteral(-2) then I the problem arises that I don't know if there was whitespace or not between these two tokens. If there was whitespace, it should parse to Unary(Minus, Int(2)) instead. If there wasn't, it should parse to IntLiteral(-2). The only way for the parser to know is for the lexer to additionally emit whitespace tokens.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to tokenize correctly when missing whitespace? #726

How to tokenize correctly when missing whitespace? #726

AndreiCravtov commented Feb 4, 2025

zesterer commented Feb 4, 2025

AndreiCravtov commented Feb 4, 2025

Zij-IT commented Feb 4, 2025

AndreiCravtov commented Feb 5, 2025

How to tokenize correctly when missing whitespace? #726

How to tokenize correctly when missing whitespace? #726

Comments

AndreiCravtov commented Feb 4, 2025

Problem

My current solution

Ideally, the solution would be??

Speculations

zesterer commented Feb 4, 2025

AndreiCravtov commented Feb 4, 2025

Zij-IT commented Feb 4, 2025

AndreiCravtov commented Feb 5, 2025