Markdown support and whitespace handling #404

jeromesimeon · 2019-08-09T16:58:04Z

There are a number of issues with whitespace occurring when trying to parse grammar and templates through commonmark.

The commonmark parser applies (arguably arcane) whitespace rules when parsing the source text containing markdown.

For instance, the whitespace at the end of a paragraph is stripped. This can be seen on the CommonMark "Dingus"

From the snippet (note the trailing whitespace after here. ):

## Try CommonMark

You can try CommonMark here.

the corresponding commonmark AST will remove the whitespace:

<document xmlns="http://commonmark.org/xml/1.0">
  <heading level="2">
    <text>Try CommonMark</text>
  </heading>
  <paragraph>
    <text>You can try CommonMark here.</text>
  </paragraph>
</document>

This can be problematic when trying to apply the same rules inside a template grammar. An example of that can be found with the copyright-notice which has a nested clause with some trailing whitespace:

3. {{#paymentClause}}Payment. As consideration in full for the rights granted herein, Licensee shall pay Licensor a one-time fee in the amount of {{amountText}} ({{amount}}) upon execution of this Agreement, payable as follows: {{paymentProcedure}}. {{/paymentClause}}

Will be parsed by commonmark as follows:

      <paragraph>
        <text>{{#paymentClause}}Payment. As consideration in full for the rights granted herein, Licensee shall pay Licensor a one-time fee in the amount of {{amountText}} ({{amount}}) upon execution of this Agreement, payable as follows: {{paymentProcedure}}. {{/paymentClause}}</text>
      </paragraph>

The relevant part is the whitespace just before the end of the paymentClause:

{{paymentProcedure}}. {{/paymentClause}}
                     ^

Which is not stripped, while the corresponding sample will have that whitespace stripped.

3. Payment. As consideration in full for the rights granted herein, Licensee shall pay Licensor a one-time fee in the amount of one hundred US Dollars (100 USD) upon execution of this Agreement, payable as follows: bank transfer.

returns the AST:

      <paragraph>
        <text>Payment. As consideration in full for the rights granted herein, Licensee shall pay Licensor a one-time fee in the amount of one hundred US Dollars (100 USD) upon execution of this Agreement, payable as follows: bank transfer.</text>
      </paragraph>

Notice the bank transfer.</text> which does not have whitespace after the .

The text was updated successfully, but these errors were encountered:

jeromesimeon · 2019-08-09T17:00:06Z

This means that applying a process where we parse both the grammar and sample text with commonmark before handing the text over to the current Cicero parser will fail in some cases. this can be seen on the current branch for markdown support at: https://github.com/accordproject/cicero/tree/ds-markdown-support

jeromesimeon · 2019-08-09T17:03:01Z

Some possible workaround:

Make the Cicero parser less whitespace sensitive (with what rules may be somewhat tricky to decide, as those rules will have to be consistent with markdown parsing, but it could be a good feature anyway)
Normalize whitespace in the grammar and sample before handing it to the common mark parser (again rules would have to be markdown-aware)
It is possible that those issues become easier to handle once we switch to the new template syntax, so this could be handled then.

mttrbrts · 2019-08-09T17:03:04Z

This sounds like an generalised instance of this issue, #197

It should be straightforward to relax the Cicero parser to be tolerant of leading and trailing whitespace, right?

jeromesimeon · 2019-08-09T17:06:02Z

It may depend on the definition of straightforward... In the example above, the "trailing" whitespace in the grammar is somewhere in between {{paymentProcedure}}. and {{/paymentClause}}

mttrbrts · 2019-08-09T17:12:31Z

Could we not just run String.trim() on the contents of the inline clause definition before parsing?

jeromesimeon · 2019-08-09T17:42:20Z

It's possible, my concern is how do those interact with whitespace rules from commonmark. Hypothetically, the inline clause could have a paragraph creation, something like:

3. {{#paymentClause}}Payment. As consideration in full for the rights granted herein, Licensee shall pay Licensor a one-time fee in the amount of {{amountText}} ({{amount}}) upon execution of this Agreement, payable as follows: {{paymentProcedure}}.

{{/paymentClause}}More text here...

With a corresponding text that looks as follows:

3. Payment. As consideration in full for the rights granted herein, Licensee shall pay Licensor a one-time fee in the amount of one hundred US Dollars (100 USD) upon execution of this Agreement, payable as follows: bank transfer.

More text here...

In which case, we may not want to simply trim the end of the inline clause. Although of course we could argue this is not a well formed example.

In other words, it would be nice if whatever rules we chose would play nice with markdown rules.

jeromesimeon · 2019-08-09T17:43:16Z

Also: we will have to apply a symmetric .trim() to the sample.

jeromesimeon · 2019-08-09T17:46:23Z

A couple of other idea we could try to improve the situation:

enforce that inline clauses have to be paragraphs (they cannot be in the middle of a paragraph) and apply commonmark paragraph rules to those
perform whitespace normalisation inside the Common mark AST resulting from parsing the grammar.

Note: I believe we use code blocks in the rich text editor to identify nested clause text, so we might want to apply a whitespace normalization strategy which is consistent with that.

dselman · 2019-08-12T22:51:39Z

I am going to experiment with some new code that generates a grammar direct from the commonmark AST. I think this will give us much more control over how we define the grammar to align whitespace handling.

It will also allow us to handle commonmark lists properly, so that the parser is aware of them.

dselman · 2019-08-14T02:43:27Z

I now have code that can parse Commonmark to a JSON and validate the JSON using a CTO model (CommonMark Model):
https://github.com/accordproject/markdown-transform/blob/master/src/CommonMarkParser.js

The pipeline is:

Markdown text -> CommonMark AST -> CommonMark XML -> 
   SAX Parser -> Instance of Commonmark CTO Model

The CommonMark CTO model is also published here:
https://models.accordproject.org/commonmark/markdown.html

Note that the detour via CommonMark XML is tactical and could be removed, but it greatly simplifies the code.

Using this CTO-based intermediate model will make the core markdown-transform and cicero projects independent of the Slate DOM (even the JSON serialisation).

The next step is to visit the JSON and generate markdown as well as a Nearley parser - bypassing the existing parser generator (which is based on the Nearley grammar for .tem files). Generating the parser from an instance of the Commonmark Model should allow us to handle lists and whitespace much more cleanly, rather than purely dealing with chunks of plain text.

In an upstream project we can then create a visitor that converts an instance of the CommonMark Model to a Slate DOM.

dselman · 2019-08-14T23:44:22Z

I've merged the changes into the master branch for markdown-transform to parse .md to a Concerto AST, as well as convert an instance of the Concerto AST to md text. I also added a lot of unit tests.

Next I will start work on a new Nearley parser generator in cicero-core.

dselman · 2019-08-19T11:43:35Z

The Nearley parser generation has raised some interesting questions for lists.

E..g If we have a static list in the tem file:

1. This
1. Is
1. an ordered list

vs. a list that references an array type:

Based on following reference data
1. Rate is between {{minRate}} and {{maxRate}}

When minRate and maxRate are fields on a concept and the template model contains an array of these concepts.

In the second example the generated parser could make the list optional and allow any number of entries, binding each entry to the array.

jeromesimeon · 2019-10-15T13:22:11Z

This might be an instance where the kind of "whitespace control" offered by Handlebars help: https://handlebars-draft.knappi.org/guide/expressions.html#whitespace-control

I'm still unsure whether there isn't a simpler way to provide for the right kind of whitespace behaviour without adding that level of complexity.

dselman · 2020-02-29T09:16:59Z

Can we close this now @jeromesimeon ?

jeromesimeon · 2020-07-05T00:24:57Z

I think we can consider this resolved in the new parser, which follows strict commonmark whitespace parsing rules.

jeromesimeon assigned dselman Aug 9, 2019

dselman mentioned this issue Aug 21, 2019

(feat) support for markdown templates #405

Merged

jeromesimeon added 0.20 Difficulty: Challenging Type: Feature Request 🛍️ New feature or request labels Sep 27, 2019

jeromesimeon self-assigned this Oct 15, 2019

jeromesimeon removed the 0.20 label Oct 30, 2019

jeromesimeon added the 1.0 label Dec 11, 2019

jeromesimeon mentioned this issue Jul 5, 2020

Revised Cicero parser #547

Closed

jeromesimeon closed this as completed Jul 5, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Markdown support and whitespace handling #404

Markdown support and whitespace handling #404

jeromesimeon commented Aug 9, 2019 •

edited

Loading

jeromesimeon commented Aug 9, 2019

jeromesimeon commented Aug 9, 2019

mttrbrts commented Aug 9, 2019

jeromesimeon commented Aug 9, 2019 •

edited

Loading

mttrbrts commented Aug 9, 2019

jeromesimeon commented Aug 9, 2019

jeromesimeon commented Aug 9, 2019 •

edited

Loading

jeromesimeon commented Aug 9, 2019 •

edited

Loading

dselman commented Aug 12, 2019

dselman commented Aug 14, 2019 •

edited

Loading

dselman commented Aug 14, 2019 •

edited

Loading

dselman commented Aug 19, 2019

jeromesimeon commented Oct 15, 2019

dselman commented Feb 29, 2020

jeromesimeon commented Jul 5, 2020 •

edited

Loading

Markdown support and whitespace handling #404

Markdown support and whitespace handling #404

Comments

jeromesimeon commented Aug 9, 2019 • edited Loading

jeromesimeon commented Aug 9, 2019

jeromesimeon commented Aug 9, 2019

mttrbrts commented Aug 9, 2019

jeromesimeon commented Aug 9, 2019 • edited Loading

mttrbrts commented Aug 9, 2019

jeromesimeon commented Aug 9, 2019

jeromesimeon commented Aug 9, 2019 • edited Loading

jeromesimeon commented Aug 9, 2019 • edited Loading

dselman commented Aug 12, 2019

dselman commented Aug 14, 2019 • edited Loading

dselman commented Aug 14, 2019 • edited Loading

dselman commented Aug 19, 2019

jeromesimeon commented Oct 15, 2019

dselman commented Feb 29, 2020

jeromesimeon commented Jul 5, 2020 • edited Loading

jeromesimeon commented Aug 9, 2019 •

edited

Loading

jeromesimeon commented Aug 9, 2019 •

edited

Loading

jeromesimeon commented Aug 9, 2019 •

edited

Loading

jeromesimeon commented Aug 9, 2019 •

edited

Loading

dselman commented Aug 14, 2019 •

edited

Loading

dselman commented Aug 14, 2019 •

edited

Loading

jeromesimeon commented Jul 5, 2020 •

edited

Loading