New parser model #31

octogonz · 2018-07-01T22:40:35Z

This PR introduces a rethought approach to parsing, with the following main ideas:

The supported CommonMark subset will now be much more constrained than before, based on the ideas in issue RFC: TSDoc-flavored-Markdown (TSFM) instead of CommonMark #29. This simplifies the job of the parser considerably.
We're making the tokenizer more primitive; tokens are now just simple punctuation characters or spans of text
The token objects are now internal implementation rather than part of the public API. (The reason is that the parser rolls them up into larger nodes such as DocDelimiter based on context, so the original tokens aren't very useful. To improve performance they will eventually be converted to very lightweight objects.)
The tokenizer now returns a full array of all tokens, rather than emitting an interactive stream of tokens. This was based on the realization that CommonMark is not a context-free grammar and frequently requires infinite lookahead, which makes streaming pointless
The AST nodes are now divided into DocNodeLeaf subclasses (which represent text) and DocNodeContainer subclasses (which represent pure structure)
The NodeParser now has a formalized model for backtracking
When backtracking occurs, the resulting DocError token distinguishes the discarded text from the text that was the "location" of the problem. (For example, in the expression "<img src=<tag />", the DocError node corresponds to the first "<", since it got downgraded to plain text, whereas the reported "location" of the error is the second "<" which will also be reinterpreted as its own node.)

Only a few TSDoc constructs are implemented in this PR:

backslash escapes
newlines
HTML start tags

The next PR will start to fill out the list of other constructs.

Code review: We realized that the prototype code has been getting churned a lot as we sort out the design. It's wasteful to review code that ends up getting deleted. In the interest of time, we're going to suspend formal code reviews for a few PRs until the core feature set is more complete. Then we'll go back and review the code in depth. We did have a series of design discussions about the high level algorithm.

…able

iclanton · 2018-07-01T22:52:02Z

Approved

pgonzal added 16 commits July 1, 2018 15:15

For diagnostic purposes, tokens now remember which line they came from

cf48983

Redesigned the Tokenizer Jest snapshot representation to be more read…

54b182f

…able

Implement TokenKind.BackslashEscapedCharacter

a93ca18

Fix issue where Travis wasn't running unit tests

09524d3

Redesign the Tokenizer to be very simple and support infinite lookahead

d6553ba

Extract TSDocParser._parseLines() into a separate LineExtractor file

6cb6850

Rename ParseLines.test.ts --> LineExtractor.test.ts

3ae0d63

Initial draft of new NodeParser

f3e06dc

Rename DocHtmlTag --> DocHtmlElement to be more correct

b5557ad

Reorganize source files

c521169

Improve Tokenizer.test.ts

d99f043

Extract Token into its own source file

8ed12d2

Redesigned AST class hierarchy

80da064

Initial sketch of redesigned parsing strategy

2a00111

All unit tests are now passing

abf550c

Updated documentation

334cdc0

octogonz merged commit 9202eb5 into master Jul 1, 2018

octogonz deleted the pgonzal/new-parser-model branch July 1, 2018 22:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New parser model #31

New parser model #31

octogonz commented Jul 1, 2018 •

edited

Loading

iclanton commented Jul 1, 2018 •

edited

Loading

New parser model #31

New parser model #31

Conversation

octogonz commented Jul 1, 2018 • edited Loading

iclanton commented Jul 1, 2018 • edited Loading

octogonz commented Jul 1, 2018 •

edited

Loading

iclanton commented Jul 1, 2018 •

edited

Loading