Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New parser model #31

Merged
merged 16 commits into from
Jul 1, 2018
Merged

New parser model #31

merged 16 commits into from
Jul 1, 2018

Conversation

octogonz
Copy link
Collaborator

@octogonz octogonz commented Jul 1, 2018

This PR introduces a rethought approach to parsing, with the following main ideas:

  • The supported CommonMark subset will now be much more constrained than before, based on the ideas in issue RFC: TSDoc-flavored-Markdown (TSFM) instead of CommonMark #29. This simplifies the job of the parser considerably.
  • We're making the tokenizer more primitive; tokens are now just simple punctuation characters or spans of text
  • The token objects are now internal implementation rather than part of the public API. (The reason is that the parser rolls them up into larger nodes such as DocDelimiter based on context, so the original tokens aren't very useful. To improve performance they will eventually be converted to very lightweight objects.)
  • The tokenizer now returns a full array of all tokens, rather than emitting an interactive stream of tokens. This was based on the realization that CommonMark is not a context-free grammar and frequently requires infinite lookahead, which makes streaming pointless
  • The AST nodes are now divided into DocNodeLeaf subclasses (which represent text) and DocNodeContainer subclasses (which represent pure structure)
  • The NodeParser now has a formalized model for backtracking
  • When backtracking occurs, the resulting DocError token distinguishes the discarded text from the text that was the "location" of the problem. (For example, in the expression "<img src=<tag />", the DocError node corresponds to the first "<", since it got downgraded to plain text, whereas the reported "location" of the error is the second "<" which will also be reinterpreted as its own node.)

Only a few TSDoc constructs are implemented in this PR:

  • backslash escapes
  • newlines
  • HTML start tags

The next PR will start to fill out the list of other constructs.

Code review: We realized that the prototype code has been getting churned a lot as we sort out the design. It's wasteful to review code that ends up getting deleted. In the interest of time, we're going to suspend formal code reviews for a few PRs until the core feature set is more complete. Then we'll go back and review the code in depth. We did have a series of design discussions about the high level algorithm.

@iclanton
Copy link
Member

iclanton commented Jul 1, 2018

Approved

Approved with PullApprove

@octogonz octogonz merged commit 9202eb5 into master Jul 1, 2018
@octogonz octogonz deleted the pgonzal/new-parser-model branch July 1, 2018 22:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants