Streaming parser #37

daveyarwood · 2017-05-10T12:40:07Z

This is a rewrite of the Alda parser from the ground-up, replacing the current parser-generator (Instaparse) implementation with an asynchronous pipeline using core.async.

The motivation for the rewrite is described in detail in #20. Parsing is currently the bottleneck when parsing large scores -- it can often take seconds, depending on the size of the score. This rewrite speeds up parsing significantly for larger scores, while remaining about the same (just slightly faster) for scores under about 100 lines, like most of our example scores -- see the benchmarks below.

The idea is that we can start to construct a score in another thread as soon as the first token is parsed. There are actually several threads going, each performing a separate stage of the parsing process in parallel. The result is much faster parsing times (depending on the size of the score) and a foundation for doing more interesting things with streaming in the future.

Benchmarks (parsing the example scores):

The metric here is the time it takes to parse an input file as events and then use the events to build a score. The "before" numbers are on the master branch, and the "after" numbers are on this branch.

across_the_sea.alda
  before: 42 ms
  after: 43 ms

awobmolg.alda
  before: 296 ms
  after: 180 ms

bach_cello_suite_no_1.alda
  before: 593 ms
  after: 368 ms

clapping_music.alda
  before: 186 ms
  after: 191 ms

debussy_quartet.alda
  before: 316 ms
  after: 267 ms

entropy.alda
  before: 105 ms
  after: 87 ms

gau.alda
  before: 66 ms
  after: 90 ms

hello_world.alda
  before: 4 ms
  after: 9 ms

key_signature.alda
  before: 37 ms
  after: 21 ms

multi-poly.alda
  before: 21 ms
  after: 12 ms

nesting.alda
  before: 29 ms
  after: 34 ms

overriding-a-global-attribute.alda
  before: 31 ms
  after: 20 ms

panning.alda
  before: 42 ms
  after: 49 ms

percussion.alda
  before: 42 ms
  after: 134 ms

phase.alda
  before: 34 ms
  after: 42 ms

poly.alda
  before: 291 ms
  after: 192 ms

printing.alda
  before: 21 ms
  after: 34 ms

variables.alda
  before: 96 ms
  after: 78 ms

variables-2.alda
  before: 12 ms
  after: 21 ms

As a final benchmark to test a very long score, I edited bach_cello_suite_no_1.alda by copying the entire file and pasting it to the end of the file about 15 times, the result being a score that is 15x the length of that example (just under 1000 lines of code):

Before: 8746 ms
After: 4709 ms

Note that this is only implements the "parse -> eval" piece of a fully streaming parse/play pipeline. I found that just by rewriting the parser by hand instead of using a parser generator library, parsing is significantly faster, but we still must wait for the score to be fully evaluated before we start to play it.

There is room for more improvement in the future -- we could start to play notes as soon as they are generated during the score-building phase. In the case of the 1000 line example above, this would take the 4709 ms wait before you start to hear any notes down to almost 0 ms! The score would still take almost 5 seconds to fully parse, but that would be happening in the background as you're listening to the score. I'm really excited about that potential, and I think this PR paves the way for us to get there in the near future.

Simplifying this a little bit: I don't think we really need to be able to "parse with context"; we only need to be able to parse input as a sequence of events, instead of a score map. Ramifications: - We are no longer in the business of generating alda.lisp code. We could add this back in the future, but I think it's extraneous since that's no longer an intermediate step in the parsing pipeline. (It actually hasn't been a step in the parsing process for some time now, and I don't consider it an essential feature of Alda.) - Not generating alda.lisp code makes testing the parser a little trickier, since not all events have good equality semantics out-of-the-box. For example, when a note event is generated by the parser, it includes an anonymous "pitch-fn" which is not considered equal (by Clojure standards) to the pitch-fn of another note, even if the note has the same pitch. I am taking this as an opportunity to re-implement certain events in ways that will give them proper equality semantics. Then testing the parser is just a matter of parsing input and comparing the actual sequence of events vs. an expected sequence of events that can be obtained by writing alda.lisp code in the test namespaces. This is nice because the tests still read just as well, but we're testing the equality of events instead of the equality of generated Clojure code which can always change arbitrarily. - The Alda REPL will need to be tweaked to use parse-input instead of parse-to-*-with-context, and also we won't know the context anymore. I think this is OK, because the only context we've ever needed so far in the REPL has been the current instruments, and those are easily obtained from the score map. My original idea (and the inspiration behind parse-to-*-with-context) was for the parser to give a lot of detail about where exactly we are in a score, e.g. are we in the middle of a voice? inside of an event sequence? in an instrument part? etc. But in practice, we haven't needed to use any of that context at all so far, and interestingly, we started adding more context to the score map itself, so I think even if we DID start to need that context in the REPL, we could just get it from the score map. So, this is a nice opportunity to clean up the code by removing complicated WIP features that we aren't using.

Conceptually simpler, and allows for notes to be equal in Clojure's eyes.

- dissoc :chord? from notes so that they can be equal to notes created via alda.lisp - don't flush the buffer on attribute changes

The parser now emits "empty" parts and voices, which have the same effect as before because they "set" the part/voice to use for the subsequent events.

daveyarwood · 2017-05-10T12:43:05Z

An added benefit of rewriting the parser from the ground up is that it fixes issue #12. Line and column numbers are propagated through correctly this way, and error messages are clearer and more precise.

daveyarwood · 2017-05-12T11:51:23Z

Trying this locally. Unfortunately, I'm hearing a lot of issues where audio playback will intermittently hang onto a note for a couple seconds or so before moving on. All the notes are in sync, it's just like the score pauses for a second before playback continues.

It seems like this is a weird interaction between JSyn and core.async.

Otherwise (e.g. if one of the events is an error), I think the channel hangs around in memory forever waiting to continue being consumed.

daveyarwood · 2017-05-12T12:41:15Z

(Note: still a problem after that last commit.)

daveyarwood · 2017-05-27T21:52:48Z

Good news! The problem only seems to happen when running in development mode via Boot. I think Boot imposes a good deal of overhead, so I think adding core.async to the mix might have been the straw that broke the camel's back.

I did a build of Alda using this branch, though, and ran it as an executable jar file, and the problem doesn't exist. Also, parsing is crazy fast. Around 100ms to parse the Bach cello suite example!

I feel good about including this in 1.0.0.

daveyarwood added 20 commits January 14, 2017 14:48

streaming parser: tokenizer

7e46451

WIP: scaffolding for event parser

03dd247

simplify, don't put final parser state on the stream

ac4cd46

include line & column number when emitting tokens

9114a53

WIP: event parser

5907521

use clojure.tools.namespace to refresh code in dev

0d610de

event parser

19a427a

add examples for untested features of alda

81e3ab2

allow names to include dots (group access operator)

a802c30

replace pitch-fn with a map containing letter and accidentals

f668ce5

Conceptually simpler, and allows for notes to be equal in Clojure's eyes.

require/refer alda.lisp on demand

dfaa9ae

aggregate_events fixes

c1b87f0

- dissoc :chord? from notes so that they can be equal to notes created via alda.lisp - don't flush the buffer on attribute changes

fixes for parsing notes, rests & chords

2bf346b

adjust expected parse results

d785bd8

The parser now emits "empty" parts and voices, which have the same effect as before because they "set" the part/voice to use for the subsequent events.

get all tests passing

f7f4761

require alda.lisp earlier in the pipeline

e409590

consistency: always pass errors through as exception objects

b5da1ed

upgrade dependencies

1da749f

documentation updates

853c94e

daveyarwood added the in progress label May 10, 2017

daveyarwood mentioned this pull request May 10, 2017

Error message line numbers / better parse error reporting #12

Closed

daveyarwood self-assigned this May 10, 2017

details for changelog

eb22ba7

daveyarwood mentioned this pull request May 10, 2017

Experiment: streaming parse and eval #20

Closed

ensure events channel is totally consumed

6182cfc

Otherwise (e.g. if one of the events is an error), I think the channel hangs around in memory forever waiting to continue being consumed.

daveyarwood added this to the 2.0.0 milestone May 14, 2017

daveyarwood added ready and removed in progress labels May 15, 2017

daveyarwood mentioned this pull request May 15, 2017

(volume 0) optimization #22

Closed

daveyarwood merged commit 80269a9 into master May 28, 2017

daveyarwood deleted the streaming-parser branch May 28, 2017 01:57

daveyarwood removed the ready label May 28, 2017

daveyarwood removed this from the 2.0.0 milestone May 31, 2017

semi-automatique mentioned this pull request Jan 25, 2018

Improve parsing time #18

Closed

daveyarwood mentioned this pull request Jan 1, 2021

Allow multiple parts inside of an event sequence / variable definition? alda-lang/alda#323

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Streaming parser #37

Streaming parser #37

daveyarwood commented May 10, 2017 •

edited

Loading

daveyarwood commented May 10, 2017

daveyarwood commented May 12, 2017

daveyarwood commented May 12, 2017

daveyarwood commented May 27, 2017

Streaming parser #37

Streaming parser #37

Conversation

daveyarwood commented May 10, 2017 • edited Loading

Benchmarks (parsing the example scores):

daveyarwood commented May 10, 2017

daveyarwood commented May 12, 2017

daveyarwood commented May 12, 2017

daveyarwood commented May 27, 2017

daveyarwood commented May 10, 2017 •

edited

Loading