Allow parsing stream (sequence) of arbitrary tokens, not a string #10

Dimagog · 2013-04-14T22:53:31Z

This may be a duplicate of #9, but maybe not.
If I already have a lexer that generates a seq of tokens I'd like to be able to parse it with instaparse.

Engelberg · 2013-04-15T01:28:24Z

When I started this project, I specifically wanted to make it easy to create a lexer-less parser. To achieve that goal, it was essential to allow regexes for the terminals. Regexes in Java only work on text, so I decided to focus on parsing strings, rather than general sequences of tokens.

The underlying algorithm could in principle be adapted to work on sequences of tokens, it just wouldn't be possible to use regular expressions with such a sequence. I don't see a realistic way to mix and match the lexer approach with the regexes-as-terminals approach -- it seems you'd have to pick one or the other to specify your tokens. I suppose I could just say "No regular expressions are allowed if you're operating on something other than strings."

One thing that worries me is that it complicates the specification of the grammar since the token definitions and token format have to be defined elsewhere.

Overall, this seems like a worthwhile extension (and your Python example is a great use case), but I need to think through some of these issues before diving in. Thanks for the suggestion.

Engelberg · 2013-04-15T01:31:40Z

In the meantime, there's a workaround you could potentially employ. You could map tokens like INDENT and DEDENT to unused characters and then rebuild it as a string, then run instaparse on that. I believe ASCII characters 0-8 and 11-31 are unused and could serve as tokens.

JulianBirch · 2013-05-20T16:46:54Z

Just a thought: could you allow mixed streams, with some pre-parsed terminals and some strings?

Another case which is hard to parse without an external lexer: mustache.with set delimiters.

Engelberg · 2013-05-20T17:37:19Z

I'm continuing to think about this, but it's definitely a non-trivial
change that requires careful consideration.

Here's one question:

If instaparse provides for streams of tokens, is instaparse also obligated
to provide an easy-to-use lexer? Or are there already good tools for
that? Or is the stream-of-tokens scenario mostly relevant for things like
Python whitespace and mustache's set delimiters which would probably
require a very customized lexer anyway?

Right now, if I needed to parse mustache, first, I'd use a regular
expression to scan the string for spots where the delimiters change. Then,
I'd preprocess the sections, converting the delimiters to unused ASCII
characters, say, 30 and 31. Then, I'd put it all back as a single string
to use with instaparse.

JulianBirch · 2013-05-20T21:19:46Z

I should say that 1) I'm not convinced this is the right thing for instaparse, but want to explore it 2) the workaround is neat but is pretty bad for preserving line information in error messages or anything that should preserve formatting 3) if you did decide to do it, and I'm not saying you should, I don't think you should feel the need to deliver a kick-ass lexer as well. Leave that for someone else :)

Dimagog · 2013-05-21T18:33:27Z

I'd say that if Instaparse is given a stream of tokens, then by definition it is out of lexing business for this particular scenario. Now it's possible (but not required) that this token stream was generated by another pass of Instaparse (lexer), then massaged by some custom code, and then passed to Instaparse (parser).

malcolmsparks · 2013-10-27T23:49:47Z

What about a mechanism for injecting a partial function that would be mapped across the sequence of terminals generated by a given production? I'm not sure how you'd implement this, or whether it would even be possible, but something akin to the positive and negative lookahead features of Instaparse.

tokens ::= #(maptok %) token*

word-batches ::= #(partition 10 %) words
words ::= words+

Perhaps the parser could be given an extra :mappings argument, to map symbols to partials. This would allow for injecting a function that could inject INDENT and DEDENTS appropriately.

In a sense, this is using Instaparse as both the lexer and the grammar processor at the same time, and having a syntax for stitching both parts together in the same syntax. You could do this in code, as Mark points out, but the convenience of declaring it in the grammar would make a big difference.

malcolmsparks · 2013-10-28T00:02:05Z

This paper discusses an approach to the indentation problem

http://michaeldadams.org/papers/layout_parsing/LayoutParsing.pdf

Tweak cljs test runner

Fixes [223][223] by allowing users to customize the starting line and column. Facilitates chunked parsing, as described in Engelberg#10 of the [performance tips][performance-tips]. [223]: Engelberg#223 [performance tips]: https://github.com/Engelberg/instaparse/blob/master/docs/Performance.md#performance-tips

Fixes [223][223] by allowing users to customize the starting line and column. Facilitates chunked parsing, as described in Engelberg#10 of the [performance tips][performance-tips]. [223]: Engelberg#223 [performance-tips]: https://github.com/Engelberg/instaparse/blob/master/docs/Performance.md#performance-tips

Engelberg mentioned this issue Apr 15, 2013

Allow creating custom terminal parse rules #9

Closed

aengelberg pushed a commit to aengelberg/instaparse that referenced this issue Jun 17, 2015

Merge pull request Engelberg#10 from aengelberg/aengelberg-cljs-patch-3

e28de20

Tweak cljs test runner

daveyarwood mentioned this issue May 2, 2016

Experiment: streaming parse and eval alda-lang/alda#225

Closed

Engelberg self-assigned this May 27, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow parsing stream (sequence) of arbitrary tokens, not a string #10

Allow parsing stream (sequence) of arbitrary tokens, not a string #10

Dimagog commented Apr 14, 2013

Engelberg commented Apr 15, 2013

Engelberg commented Apr 15, 2013

JulianBirch commented May 20, 2013

Engelberg commented May 20, 2013

JulianBirch commented May 20, 2013

Dimagog commented May 21, 2013

malcolmsparks commented Oct 27, 2013

malcolmsparks commented Oct 28, 2013

Allow parsing stream (sequence) of arbitrary tokens, not a string #10

Allow parsing stream (sequence) of arbitrary tokens, not a string #10

Comments

Dimagog commented Apr 14, 2013

Engelberg commented Apr 15, 2013

Engelberg commented Apr 15, 2013

JulianBirch commented May 20, 2013

Engelberg commented May 20, 2013

JulianBirch commented May 20, 2013

Dimagog commented May 21, 2013

malcolmsparks commented Oct 27, 2013

malcolmsparks commented Oct 28, 2013