Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow parsing stream (sequence) of arbitrary tokens, not a string #10

Open
Dimagog opened this issue Apr 14, 2013 · 8 comments
Open

Allow parsing stream (sequence) of arbitrary tokens, not a string #10

Dimagog opened this issue Apr 14, 2013 · 8 comments
Assignees

Comments

@Dimagog
Copy link

Dimagog commented Apr 14, 2013

This may be a duplicate of #9, but maybe not.
If I already have a lexer that generates a seq of tokens I'd like to be able to parse it with instaparse.

@Engelberg
Copy link
Owner

When I started this project, I specifically wanted to make it easy to create a lexer-less parser. To achieve that goal, it was essential to allow regexes for the terminals. Regexes in Java only work on text, so I decided to focus on parsing strings, rather than general sequences of tokens.

The underlying algorithm could in principle be adapted to work on sequences of tokens, it just wouldn't be possible to use regular expressions with such a sequence. I don't see a realistic way to mix and match the lexer approach with the regexes-as-terminals approach -- it seems you'd have to pick one or the other to specify your tokens. I suppose I could just say "No regular expressions are allowed if you're operating on something other than strings."

One thing that worries me is that it complicates the specification of the grammar since the token definitions and token format have to be defined elsewhere.

Overall, this seems like a worthwhile extension (and your Python example is a great use case), but I need to think through some of these issues before diving in. Thanks for the suggestion.

@Engelberg
Copy link
Owner

In the meantime, there's a workaround you could potentially employ. You could map tokens like INDENT and DEDENT to unused characters and then rebuild it as a string, then run instaparse on that. I believe ASCII characters 0-8 and 11-31 are unused and could serve as tokens.

@JulianBirch
Copy link

Just a thought: could you allow mixed streams, with some pre-parsed terminals and some strings?

Another case which is hard to parse without an external lexer: mustache.with set delimiters.

@Engelberg
Copy link
Owner

I'm continuing to think about this, but it's definitely a non-trivial
change that requires careful consideration.

Here's one question:

If instaparse provides for streams of tokens, is instaparse also obligated
to provide an easy-to-use lexer? Or are there already good tools for
that? Or is the stream-of-tokens scenario mostly relevant for things like
Python whitespace and mustache's set delimiters which would probably
require a very customized lexer anyway?

Right now, if I needed to parse mustache, first, I'd use a regular
expression to scan the string for spots where the delimiters change. Then,
I'd preprocess the sections, converting the delimiters to unused ASCII
characters, say, 30 and 31. Then, I'd put it all back as a single string
to use with instaparse.

@JulianBirch
Copy link

I should say that 1) I'm not convinced this is the right thing for instaparse, but want to explore it 2) the workaround is neat but is pretty bad for preserving line information in error messages or anything that should preserve formatting 3) if you did decide to do it, and I'm not saying you should, I don't think you should feel the need to deliver a kick-ass lexer as well. Leave that for someone else :)

@Dimagog
Copy link
Author

Dimagog commented May 21, 2013

I'd say that if Instaparse is given a stream of tokens, then by definition it is out of lexing business for this particular scenario. Now it's possible (but not required) that this token stream was generated by another pass of Instaparse (lexer), then massaged by some custom code, and then passed to Instaparse (parser).

@malcolmsparks
Copy link

What about a mechanism for injecting a partial function that would be mapped across the sequence of terminals generated by a given production? I'm not sure how you'd implement this, or whether it would even be possible, but something akin to the positive and negative lookahead features of Instaparse.

tokens ::= #(maptok %) token*

word-batches ::= #(partition 10 %) words
words ::= words+

Perhaps the parser could be given an extra :mappings argument, to map symbols to partials. This would allow for injecting a function that could inject INDENT and DEDENTS appropriately.

In a sense, this is using Instaparse as both the lexer and the grammar processor at the same time, and having a syntax for stitching both parts together in the same syntax. You could do this in code, as Mark points out, but the convenience of declaring it in the grammar would make a big difference.

@malcolmsparks
Copy link

This paper discusses an approach to the indentation problem

http://michaeldadams.org/papers/layout_parsing/LayoutParsing.pdf

aengelberg pushed a commit to aengelberg/instaparse that referenced this issue Jun 17, 2015
@Engelberg Engelberg self-assigned this May 27, 2017
mainej added a commit to mainej/instaparse that referenced this issue May 9, 2023
Fixes [223][223] by allowing users to customize the starting line and column.
Facilitates chunked parsing, as described in Engelberg#10 of the [performance
tips][performance-tips].

[223]: Engelberg#223
[performance tips]: https://github.com/Engelberg/instaparse/blob/master/docs/Performance.md#performance-tips
mainej added a commit to mainej/instaparse that referenced this issue May 9, 2023
Fixes [223][223] by allowing users to customize the starting line and column.
Facilitates chunked parsing, as described in Engelberg#10 of the [performance
tips][performance-tips].

[223]: Engelberg#223
[performance-tips]: https://github.com/Engelberg/instaparse/blob/master/docs/Performance.md#performance-tips
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants