-
Notifications
You must be signed in to change notification settings - Fork 149
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow parsing stream (sequence) of arbitrary tokens, not a string #10
Comments
When I started this project, I specifically wanted to make it easy to create a lexer-less parser. To achieve that goal, it was essential to allow regexes for the terminals. Regexes in Java only work on text, so I decided to focus on parsing strings, rather than general sequences of tokens. The underlying algorithm could in principle be adapted to work on sequences of tokens, it just wouldn't be possible to use regular expressions with such a sequence. I don't see a realistic way to mix and match the lexer approach with the regexes-as-terminals approach -- it seems you'd have to pick one or the other to specify your tokens. I suppose I could just say "No regular expressions are allowed if you're operating on something other than strings." One thing that worries me is that it complicates the specification of the grammar since the token definitions and token format have to be defined elsewhere. Overall, this seems like a worthwhile extension (and your Python example is a great use case), but I need to think through some of these issues before diving in. Thanks for the suggestion. |
In the meantime, there's a workaround you could potentially employ. You could map tokens like INDENT and DEDENT to unused characters and then rebuild it as a string, then run instaparse on that. I believe ASCII characters 0-8 and 11-31 are unused and could serve as tokens. |
Just a thought: could you allow mixed streams, with some pre-parsed terminals and some strings? Another case which is hard to parse without an external lexer: mustache.with set delimiters. |
I'm continuing to think about this, but it's definitely a non-trivial Here's one question: If instaparse provides for streams of tokens, is instaparse also obligated Right now, if I needed to parse mustache, first, I'd use a regular |
I should say that 1) I'm not convinced this is the right thing for instaparse, but want to explore it 2) the workaround is neat but is pretty bad for preserving line information in error messages or anything that should preserve formatting 3) if you did decide to do it, and I'm not saying you should, I don't think you should feel the need to deliver a kick-ass lexer as well. Leave that for someone else :) |
I'd say that if Instaparse is given a stream of tokens, then by definition it is out of lexing business for this particular scenario. Now it's possible (but not required) that this token stream was generated by another pass of Instaparse (lexer), then massaged by some custom code, and then passed to Instaparse (parser). |
What about a mechanism for injecting a partial function that would be mapped across the sequence of terminals generated by a given production? I'm not sure how you'd implement this, or whether it would even be possible, but something akin to the positive and negative lookahead features of Instaparse.
Perhaps the parser could be given an extra :mappings argument, to map symbols to partials. This would allow for injecting a function that could inject INDENT and DEDENTS appropriately. In a sense, this is using Instaparse as both the lexer and the grammar processor at the same time, and having a syntax for stitching both parts together in the same syntax. You could do this in code, as Mark points out, but the convenience of declaring it in the grammar would make a big difference. |
This paper discusses an approach to the indentation problem http://michaeldadams.org/papers/layout_parsing/LayoutParsing.pdf |
Tweak cljs test runner
Fixes [223][223] by allowing users to customize the starting line and column. Facilitates chunked parsing, as described in Engelberg#10 of the [performance tips][performance-tips]. [223]: Engelberg#223 [performance tips]: https://github.com/Engelberg/instaparse/blob/master/docs/Performance.md#performance-tips
Fixes [223][223] by allowing users to customize the starting line and column. Facilitates chunked parsing, as described in Engelberg#10 of the [performance tips][performance-tips]. [223]: Engelberg#223 [performance-tips]: https://github.com/Engelberg/instaparse/blob/master/docs/Performance.md#performance-tips
This may be a duplicate of #9, but maybe not.
If I already have a lexer that generates a seq of tokens I'd like to be able to parse it with instaparse.
The text was updated successfully, but these errors were encountered: