Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Markdown support and whitespace handling #404

Closed
jeromesimeon opened this issue Aug 9, 2019 · 15 comments
Closed

Markdown support and whitespace handling #404

jeromesimeon opened this issue Aug 9, 2019 · 15 comments

Comments

@jeromesimeon
Copy link
Member

jeromesimeon commented Aug 9, 2019

There are a number of issues with whitespace occurring when trying to parse grammar and templates through commonmark.

The commonmark parser applies (arguably arcane) whitespace rules when parsing the source text containing markdown.

For instance, the whitespace at the end of a paragraph is stripped. This can be seen on the CommonMark "Dingus"

From the snippet (note the trailing whitespace after here. ):

## Try CommonMark

You can try CommonMark here.  

the corresponding commonmark AST will remove the whitespace:

<document xmlns="http://commonmark.org/xml/1.0">
  <heading level="2">
    <text>Try CommonMark</text>
  </heading>
  <paragraph>
    <text>You can try CommonMark here.</text>
  </paragraph>
</document>

This can be problematic when trying to apply the same rules inside a template grammar. An example of that can be found with the copyright-notice which has a nested clause with some trailing whitespace:

3. {{#paymentClause}}Payment. As consideration in full for the rights granted herein, Licensee shall pay Licensor a one-time fee in the amount of {{amountText}} ({{amount}}) upon execution of this Agreement, payable as follows: {{paymentProcedure}}. {{/paymentClause}}

Will be parsed by commonmark as follows:

      <paragraph>
        <text>{{#paymentClause}}Payment. As consideration in full for the rights granted herein, Licensee shall pay Licensor a one-time fee in the amount of {{amountText}} ({{amount}}) upon execution of this Agreement, payable as follows: {{paymentProcedure}}. {{/paymentClause}}</text>
      </paragraph>

The relevant part is the whitespace just before the end of the paymentClause:

{{paymentProcedure}}. {{/paymentClause}}
                     ^    

Which is not stripped, while the corresponding sample will have that whitespace stripped.

3. Payment. As consideration in full for the rights granted herein, Licensee shall pay Licensor a one-time fee in the amount of one hundred US Dollars (100 USD) upon execution of this Agreement, payable as follows: bank transfer.

returns the AST:

      <paragraph>
        <text>Payment. As consideration in full for the rights granted herein, Licensee shall pay Licensor a one-time fee in the amount of one hundred US Dollars (100 USD) upon execution of this Agreement, payable as follows: bank transfer.</text>
      </paragraph>

Notice the bank transfer.</text> which does not have whitespace after the .

@jeromesimeon
Copy link
Member Author

This means that applying a process where we parse both the grammar and sample text with commonmark before handing the text over to the current Cicero parser will fail in some cases. this can be seen on the current branch for markdown support at: https://github.com/accordproject/cicero/tree/ds-markdown-support

@jeromesimeon
Copy link
Member Author

Some possible workaround:

  • Make the Cicero parser less whitespace sensitive (with what rules may be somewhat tricky to decide, as those rules will have to be consistent with markdown parsing, but it could be a good feature anyway)
  • Normalize whitespace in the grammar and sample before handing it to the common mark parser (again rules would have to be markdown-aware)
  • It is possible that those issues become easier to handle once we switch to the new template syntax, so this could be handled then.

@mttrbrts
Copy link
Member

mttrbrts commented Aug 9, 2019

This sounds like an generalised instance of this issue, #197

It should be straightforward to relax the Cicero parser to be tolerant of leading and trailing whitespace, right?

@jeromesimeon
Copy link
Member Author

jeromesimeon commented Aug 9, 2019

It may depend on the definition of straightforward... In the example above, the "trailing" whitespace in the grammar is somewhere in between {{paymentProcedure}}. and {{/paymentClause}}

@mttrbrts
Copy link
Member

mttrbrts commented Aug 9, 2019

Could we not just run String.trim() on the contents of the inline clause definition before parsing?

@jeromesimeon
Copy link
Member Author

It's possible, my concern is how do those interact with whitespace rules from commonmark. Hypothetically, the inline clause could have a paragraph creation, something like:

3. {{#paymentClause}}Payment. As consideration in full for the rights granted herein, Licensee shall pay Licensor a one-time fee in the amount of {{amountText}} ({{amount}}) upon execution of this Agreement, payable as follows: {{paymentProcedure}}.

{{/paymentClause}}More text here...

With a corresponding text that looks as follows:

3. Payment. As consideration in full for the rights granted herein, Licensee shall pay Licensor a one-time fee in the amount of one hundred US Dollars (100 USD) upon execution of this Agreement, payable as follows: bank transfer.

More text here...

In which case, we may not want to simply trim the end of the inline clause. Although of course we could argue this is not a well formed example.

In other words, it would be nice if whatever rules we chose would play nice with markdown rules.

@jeromesimeon
Copy link
Member Author

jeromesimeon commented Aug 9, 2019

Also: we will have to apply a symmetric .trim() to the sample.

@jeromesimeon
Copy link
Member Author

jeromesimeon commented Aug 9, 2019

A couple of other idea we could try to improve the situation:

  • enforce that inline clauses have to be paragraphs (they cannot be in the middle of a paragraph) and apply commonmark paragraph rules to those
  • perform whitespace normalisation inside the Common mark AST resulting from parsing the grammar.

Note: I believe we use code blocks in the rich text editor to identify nested clause text, so we might want to apply a whitespace normalization strategy which is consistent with that.

@dselman
Copy link
Contributor

dselman commented Aug 12, 2019

I am going to experiment with some new code that generates a grammar direct from the commonmark AST. I think this will give us much more control over how we define the grammar to align whitespace handling.

It will also allow us to handle commonmark lists properly, so that the parser is aware of them.

@dselman
Copy link
Contributor

dselman commented Aug 14, 2019

I now have code that can parse Commonmark to a JSON and validate the JSON using a CTO model (CommonMark Model):
https://github.com/accordproject/markdown-transform/blob/master/src/CommonMarkParser.js

The pipeline is:

Markdown text -> CommonMark AST -> CommonMark XML -> 
   SAX Parser -> Instance of Commonmark CTO Model

The CommonMark CTO model is also published here:
https://models.accordproject.org/commonmark/markdown.html

Note that the detour via CommonMark XML is tactical and could be removed, but it greatly simplifies the code.

Using this CTO-based intermediate model will make the core markdown-transform and cicero projects independent of the Slate DOM (even the JSON serialisation).

The next step is to visit the JSON and generate markdown as well as a Nearley parser - bypassing the existing parser generator (which is based on the Nearley grammar for .tem files). Generating the parser from an instance of the Commonmark Model should allow us to handle lists and whitespace much more cleanly, rather than purely dealing with chunks of plain text.

In an upstream project we can then create a visitor that converts an instance of the CommonMark Model to a Slate DOM.

@dselman
Copy link
Contributor

dselman commented Aug 14, 2019

I've merged the changes into the master branch for markdown-transform to parse .md to a Concerto AST, as well as convert an instance of the Concerto AST to md text. I also added a lot of unit tests.

Next I will start work on a new Nearley parser generator in cicero-core.

@dselman
Copy link
Contributor

dselman commented Aug 19, 2019

The Nearley parser generation has raised some interesting questions for lists.

E..g If we have a static list in the tem file:

1. This
1. Is
1. an ordered list

vs. a list that references an array type:

Based on following reference data
1. Rate is between {{minRate}} and {{maxRate}}

When minRate and maxRate are fields on a concept and the template model contains an array of these concepts.

In the second example the generated parser could make the list optional and allow any number of entries, binding each entry to the array.

@jeromesimeon
Copy link
Member Author

This might be an instance where the kind of "whitespace control" offered by Handlebars help: https://handlebars-draft.knappi.org/guide/expressions.html#whitespace-control

I'm still unsure whether there isn't a simpler way to provide for the right kind of whitespace behaviour without adding that level of complexity.

@dselman
Copy link
Contributor

dselman commented Feb 29, 2020

Can we close this now @jeromesimeon ?

@jeromesimeon
Copy link
Member Author

jeromesimeon commented Jul 5, 2020

I think we can consider this resolved in the new parser, which follows strict commonmark whitespace parsing rules.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants