Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revised Cicero parser #547

Closed
jeromesimeon opened this issue Apr 23, 2020 · 10 comments
Closed

Revised Cicero parser #547

jeromesimeon opened this issue Apr 23, 2020 · 10 comments

Comments

@jeromesimeon
Copy link
Member

jeromesimeon commented Apr 23, 2020

Is your feature request related to a problem? Please describe.
The current Cicero parser is fairly hard to maintain, leads to performance problems (#523), and is difficult to make extensible. It also doesn't know about markdown (notably is not aware of whitespace rules) which can lead to confusing errors for users.

Describe the solution you'd like
A full revision, possibly including switching to a separate parsing technology, which would have the following properties:

  • Expressive enough for our use cases
  • Lightweight (small library and memory footprint during parsing)
  • Composable to facilitate extensions
  • Can be used in the browser

Describe alternatives you've considered
There are lots of parsing technologies available, with different trade-offs. Some of the options:

  • Stick to generate a parser with Nearley
  • Switch to another parser generator (e.g., pegjs or antlr)
  • Switch to a different kind of parsing technology (e.g., regex!? or maybe more realistically a parser combinator library like https://github.com/jneen/parsimmon)

Additional context
Add any other context or screenshots about the feature request here.

@jeromesimeon
Copy link
Member Author

jeromesimeon commented Apr 23, 2020

Checklist

Parser foundations

  • Input Parsing Expressivity
    • regular expressions
    • sequences
    • choice
    • repetition
    • text
    • quoted text
    • formats
  • Output AST construction
  • Error reporting with line handling

Note: any ambiguity handling has to be written within the parser itself through combinators

Functionality

  • Template features
    • Text blocks
    • Variables
      • Double
      • String
      • Integer
      • DateTime
      • Boolean
      • Relationships
      • Complex Types
      • "Current" variable (NEW)
    • Ergo formulas
    • Blocks
      • Conditional blocks
      • With blocks
      • Clause blocks
      • List blocks
      • Join blocks
      • Optional blocks (NEW)
      • Clause references (NEW)
    • Formatting
      • DateTime format
      • Amount format
      • Monetary amount format
  • Error handling
    • Report errors with a message which includes part of the input text which did not parse
    • Raise proper AP exception
    • Investigate quality/accuracy of error reporting
  • Markdown handling
  • Current time handling for DateTime variables

Integration

  • Validate parsing result against Concerto model for the template
  • Decide on CiceroMark extension for templates (Also see Template Grammar to Slate DOM with Wrapped Variables markdown-transform#151)
  • Decide how we create CiceroMark for templates
  • Changes to Markdown transform
    • parse functionality to go from CiceroMark templates + CiceroMark Instances to JSON variables
    • Adjust markdown template parsing to use CiceroMark for templates
    • Deprecate markdown with wrapped variables? (aka CiceroEdit)
    • Decide how to integrate template parsing in the CLI (Currently added as a transform to data)
  • Changes in Ergo for drafting directly to CiceroMark
  • Changes in Cicero
    • Generate CiceroMark for templates
    • Switch parse to use new parser
    • Switch draft to use new parser
    • Remove text-based roundtrips
    • Remove old parser infrastructure
    • Register compiled formula for drafting
    • Code cleanup

Performances

System

  • Browser support
  • Windows support

Libraries & Tooling

@jeromesimeon
Copy link
Member Author

jeromesimeon commented Apr 23, 2020

Added a test with a "large" document (it's not that large, only about 300k, but that's about 1600 lines and essentially based on the "huge" document in the Slate demo).

Parsing time (including in the error case) looks pretty good so far:

bash-3.2$ du -k test/data/large.md 
284	test/data/large.md
bash-3.2$ du -k test/data/largeErr.md 
284	test/data/largeErr.md
bash-3.2$ time ./bin/index.js parse --template ./test/data/templateLarge.json --input ./test/data/large.md 
4:25:09 PM - info:
{
  "$class": "org.test.MyContract",
  "clause3": {
    "$class": "org.test.MyClause",
    "seller": "Steve",
    "buyer": "Betty",
    "amount": 3131,
    "currency": "EUR",
    "forceMajeure": false
  },
  "penalty": 10.99
}
./bin/index.js parse --template ./test/data/templateLarge.json --input   0.22s user 0.05s system 104% cpu 0.260 total
bash-3.2$ time ./bin/index.js parse --template ./test/data/templateLarge.json --input ./test/data/largeErr.md 
4:25:51 PM - error: {"status":false,"index":{"offset":145271,"line":803,"column":40},"expected":["/\"[^\"]*\"/"]}
./bin/index.js parse --template ./test/data/templateLarge.json --input   0.21s user 0.05s system 104% cpu 0.249 total
bash-3.2$ 

@jeromesimeon
Copy link
Member Author

jeromesimeon commented Apr 23, 2020

Warning: this is a test done in isolation for the parser alone outside of Cicero

Pretty good news on the memory usage front (and I believe excellent news on the runtime performances).

For a small template: memory usage shot at 3M (most of which is code)
Screenshot 2020-04-23 at 5 30 29 PM

For a 300k template: memory usage shoots up to 10M (the extra memory seems to be mostly static strings). It's a fairly rough/quick estimate.
Screenshot 2020-04-23 at 5 28 02 PM
Screenshot 2020-04-23 at 5 28 18 PM

Runtime performance seems to be well under 10ms even on the large template (done with 100 consecutive parses, but the time even includes loading the file from disc).

bash-3.2$ time ./bin/index.js parse --template ./test/data/templateLarge.json --input ./test/data/large.md 
<plate ./test/data/templateLarge.json --input ./test/data/large.md 
5:19:33 PM - info:
{
  "$class": "org.test.MyContract",
  "clause3": {
    "$class": "org.test.MyClause",
    "seller": "Steve",
    "buyer": "Betty",
    "amount": 3131,
    "currency": "EUR",
    "forceMajeure": false
  },
  "penalty": 10.99
}
Average time:  14.012748956680298
5:19:34 PM - info:
{
  "$class": "org.test.MyContract",
  "clause3": {
    "$class": "org.test.MyClause",
    "seller": "Steve",
    "buyer": "Betty",
    "amount": 3131,
    "currency": "EUR",
    "forceMajeure": false
  },
  "penalty": 10.99
}
Average time:  8.532179951667786
...
5:20:07 PM - info:
{
  "$class": "org.test.MyContract",
  "clause3": {
    "$class": "org.test.MyClause",
    "seller": "Steve",
    "buyer": "Betty",
    "amount": 3131,
    "currency": "EUR",
    "forceMajeure": false
  },
  "penalty": 10.99
}
Average time:  3.1058295691013336

@jeromesimeon
Copy link
Member Author

Warning: this is a test done in isolation for the parser alone outside of Cicero

Pretty good news on the memory usage front (and I believe excellent news on the runtime performances).

Instrumented version of ./bin/index.js used for those tests.
index.txt

@jeromesimeon
Copy link
Member Author

jeromesimeon commented May 5, 2020

Some possible additional questions or work item (edited from feedback during review with @dselman @irmerk @DianaLease )

  • Move the text generation closer to the parsing?
  • Parse templates themselves using the same parser technology Parse templates as a markdown extension
  • Can we generate a BNF? Or some lang. independent description of the grammar?
  • Can we make the parsing table extensible by users?
  • Decide whether this should live in markdown transform or not
  • Can we also redo the Concerto parser using the same parser technology?

@jeromesimeon
Copy link
Member Author

Another useful benchmarking point, courtesy of @mttrbrts

https://sap.github.io/chevrotain/performance/

Screenshot 2020-05-06 at 5 33 03 PM

@jeromesimeon
Copy link
Member Author

jeromesimeon commented May 21, 2020

List of changes (⚠️ update: 2020-07-01)

Changes to Cicero template grammar

  • Template Gammars are now a proper extension to commonmark
  • variables, conditional template blocks, join template blocks, with template blocks and formulas are all treated as commonmark inlines (i.e., elements inside a leaf block like a paragraph or heading)
  • clause template blocks, and list template blocks are treated as commonmark blocks (i.e., similar to blockquotes in commonmark) and may contain arbitrary markdown like paragraphs, other lists, etc. For those block markers must start on a new line and can be followed by whitespace but not text e.g.,:
{{#clause payment}}
### Pricing

This is text indicating payment of {{price}}
{{/clause}}

{{#ulist items}
This is an item with

multiple paragraphs with an item {{price}}
{{/ulist}}

Changes to CiceroMark DOM

New CiceroMark model: https://js-new-ciceromark--accordproject-models.netlify.app/markdown/[email protected]

  • Simplified hierarchy
  • New terminology distinguishes: variables, blocks and formulas
  • id and clauseid are now name
  • Distinguishes EnumVariable and FormattedVariable from other Variable nodes
  • ComputedVariable is now Formula
  • Additional information is available in the DOM: type of variables and blocks, enumValues for enumerated variables, code and dependencies information in formulas
  • ListVariable is now ListBlock and does not inherit from commonmark List node
  • ListBlock now have a name field
  • ConditionalVariable is now Conditional and the whenTrue and whenFalse branches can contain markdown (nodes rather than a string)

New TemplateMark DOM

https://js-new-ciceromark--accordproject-models.netlify.app/markdown/templatemark.html

  • Aligned with new CiceroMark and represents template grammar rather than sample.
  • Includes nodes for: variables (e.g., {{price}}), blocks (e.g., {{#if}}force majeure{{/if}}) and formulas (e.g., {{% 1+1 %}})

Changes to Slate DOM

  • id has been changed to name in variables nodes
  • clauseid has been changed to name in clause nodes
  • computed has been changed to formula in node types
  • the data part of the node has some additional information in some cases (e.g., enumValues) for enumerated variables
  • conditional node now can contain arbitrary content not just strings, including in the whenTrue and whenFalse data annotations
  • variable and clause nodes come with additional elementType information to indicate the type of the variable/clause.
  • variables corresponding to relationship also have an identifiedBy information to indicate the name of the class identifier for the relationship
  • formula nodes now have additional information code for the source code of the formula and dependencies for the list of variables the formula depends on

Changes to HTML serialization

  • id and clauseid have been changed to name
  • format and enumValues is available on corresponding variables

Changes to markdown-transform CLI

CiceroEdit Changes

  • There is a new ciceroedit format which corresponds to the (now deprecated) commonmark + wrapped variables and clauses. There is a migration path from ciceroedit to the format understood by the new parser.

@jeromesimeon
Copy link
Member Author

jeromesimeon commented Jun 5, 2020

Template checklist:

  • acceptance-of-delivery
  • car-rental-tr (FIX: variables in code blocks no longer supported + clause block written as proper markdown block + sample had inconsistent variables)
  • certificate-of-incorporation
  • copyright-license (FIX: clause block written as proper markdown block)
  • demandforecast
  • docusign-connect
  • docusign-po-failure
  • eat-apples
  • empty
  • empty-contract
  • fixed-interests
  • fixed-interests-static
  • fragile-goods
  • full-payment-upon-demand
  • full-payment-upon-signature
  • hellomodule
  • helloworld
  • helloworldstate
  • installment-sale
  • interest-rate-swap
  • ip-payment
  • latedeliveryandpenalty
  • latedeliveryandpenalty-else
  • lateinvoicewithpayment
  • minilatedeliveryandpenalty
  • minilatedeliveryandpenalty-capped
  • minilatedeliveryandpenalty-payment
  • one-time-payment-tr (FIX: variables in code blocks no longer supported + sample had inconsistent variables)
  • online-payment-contract-tr (FIX: variables in code blocks no longer supported)
  • payment-upon-delivery
  • payment-upon-iot
  • payment-upon-signature
  • perishable-goods
  • promissory-note
  • promissory-note-md
  • rental-deposit
  • rental-deposit-with
  • saft
  • safte
  • sales-contract-ru (FIX: variables in code blocks no longer supported)
  • servicelevelagreement
  • simplelatedeliveryandpenalty
  • supplyagreement
  • supplyagreement-perishable-goods (FIX: variables in code blocks no longer supported)
  • volumediscount
  • volumediscountolist (FIX: list block written as proper markdown block)
  • volumediscountulist (FIX: list block written as proper markdown block)

@jeromesimeon
Copy link
Member Author

jeromesimeon commented Jun 26, 2020

Parsing issues

In Markdown Transform

In Ergo

In Cicero

In Web Components

Misc related issues

@jeromesimeon
Copy link
Member Author

This issue is resolved with the release of Cicero 0.21. We can open new issues if needed for features that didn't make the cut, or for any bug or new requirements.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant