New pathological parsing case (quadratic) #178

jgm · 2017-01-04T06:37:55Z

'a**b' + 'c* ' * n

That is,

a**bc* c* c* c* c* c* c* c* c* c* c* c* c* c* c* c* c* c* c* c* c* c* c* c* c* c* c* c* c* c* c* c* c* c* c* c* c* ...

The text was updated successfully, but these errors were encountered:

jgm · 2017-01-04T06:45:04Z

Note, we don't get a similar problem with

a***bc* c* c* c* c* ...

because in this case the opening delimiter span quickly gets used up for matches. In the above case, it does not, because of the "multiple of 3" rule. So we have to keep coming back to it.

jgm · 2017-01-04T07:12:19Z

The problem is that when a match fails because of the "multiple of 3" rule, we can't set openers_bottom to rule out future matches to that opener.

jgm · 2017-01-04T07:13:37Z

One possible solution would be to make openers_bottom a 2D array, with the first dimension the delim char and the second the original delim run length % 3.

The new "multiple of 3" rule defeats one of our optimizations.

mity · 2017-01-04T10:50:33Z

@jgm Sorry to comment something what's closed but... where is the quadratic complexity coming from?

There is only one potential opener, the initial **. All the other marks are only potential closers so they do not need to be remembered in the openers list. So the list of potential openers should always have just a single member.

raphlinus · 2017-01-04T16:12:58Z

@mity Yeah, my current code is also not quadratic on this, it doesn't push runs that are can_close but not can_open. But you can make my code quadratic too with a similar test case, say `'***a' * n.

I haven't looked at the patch code or played with the implementation, but is it enough to parametrize on just the match char and the length mod 3? It seems to me there are more cases, either the potential match is both can_open and can_close, or not. You'll get a different match depending on whether the closer is both.

jgm · 2017-01-04T16:36:30Z

+++ Martin Mitáš [Jan 04 17 02:50 ]:

***@***.*** Sorry to comment something what's closed but... where is the quadratic complexity coming from? There is only one potential opener, the initial **. All the other marks are only potential closers so they do not need to be remembered in the openers list. So the list of potential openers should always have just a single member.

In cmark we just have one list of potential openers and closers, which we traverse backwards looking for an opener for each successive closer. Of course we immediately skip over elements in the list that are not marked can_open, but we still traverse them. I avoided unnecessary traversals before by keeping track of a lower bound for an opener. So, if we failed to match an opener for a * delimiter at position 6, then when we're matching an opener at position 10, we don't need to go back before position 6. But the new "multiple of 3" rule messed that up. When we failed to match an opener because of that rule, we couldn't revise the lower bound for * openers, since the same opener might match a delimiter run with a different length. My solution was to parameterize not just on the character type but on the length mod 3. How do you handle this in md4c? I noted that you didn't have the quadratic complexity in this case.

mity · 2017-01-04T16:36:37Z

@raphlinus Well, I don't know internals of Cmark. I'm just interested because of my own iplementation where I do not need any complex handling beyond the algorithm described in appendix of CommonMark spec, and I just want to be sure I haven't overlooked something...

My way of thinking is as follows:

Lets ignore _ and consider only * as the rule of three applies only to that.
To make a quadratic complexity, you have to saturate the list of unresolved openers with arbitrary count of marks which are then never actually resolved but all subsequent potential closer marks have still to walk the list and check them.
Openers not inside a word cannot saturate the list because they may be resolved on first occasion with any closer with the same character.
Openers inside a word are always also a potential closers. Given the rule of three and its modulo-3 arithmetics, there cannot be more then three such unresolved marks in the list of potential openers at the same time. And that makes something like O(3 * n), not O(n^2).

Therefore I am wondering whether I miss something important or whether the fix is actually a workaround of some another issue inside Cmark which means the list is saturated with something what should not be in the list on the 1st place.

mity · 2017-01-04T16:38:16Z

@jgm That explains it. Ignore mu latest post then, I wrote that before seeing your comment.

jgm · 2017-01-04T16:51:17Z

+++ Raph Levien [Jan 04 17 08:12 ]:

I haven't looked at the patch code or played with the implementation, but is it enough to parametrize on just the match char and the length mod 3? It seems to me there are more cases, either the potential match is both can_open and can_close, or not. You'll get a different match depending on whether the closer is both.

Let's see. The multiple-of-3 rule only kicks in when either the opening or closing delimiter run is BOTH can-open and can-close. So the condition for a match is: 1. delim char is the same, AND 2. either (a) the opening delimiter run is ONLY can-open and the closing run is ONLY can-close, or (b) the sum of their lengths mod 3 is not 0. We're storing our knowledge about the last delimiter run that meets conditions 1 and 2(b). It would be better to take into account 2(a) as well. Perhaps we could make the second array parameter something like (opening delimiter run is only can-open and closing is only can-close) ? 3 : (sum of lengths mod 3)

raphlinus · 2017-01-04T18:06:13Z

@jgm I think that's enough information. When touching a closing run that is can_close only, you look at both the mod-3 and the can_open case, and take the later of the two? I was thinking of an array of 6 rather than an array of 4, where that logic was done at write time, but I think yours is better. Thanks!

commonmark/commonmark.js@903f35e Author: John MacFarlane <[email protected]> Date: Fri Aug 30 16:24:48 2019 -0700 Fix pathological case commonmark/cmark#178. commonmark/commonmark.js@34b8f65 Author: John MacFarlane <[email protected]> Date: Fri Aug 30 16:28:09 2019 -0700

jgm added a commit that referenced this issue Jan 4, 2017

Added pathological test for #178 (quadratic).

b5c4a22

The new "multiple of 3" rule defeats one of our optimizations.

jgm closed this as completed in a8d5bd4 Jan 4, 2017

kivikakk mentioned this issue Jul 11, 2017

Quadratic blowup on pathological input. kivikakk/comrak#24

Closed

mity mentioned this issue Sep 27, 2017

Pathological inputs commonmark/commonmark.js#129

Open

jgm mentioned this issue Aug 30, 2019

Parsing emphasis takes quadratic time commonmark/commonmark.js#172

Closed

jgm added a commit to commonmark/commonmark.js that referenced this issue Aug 30, 2019

Fix pathological case commonmark/cmark#178.

34b8f65

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New pathological parsing case (quadratic) #178

New pathological parsing case (quadratic) #178

jgm commented Jan 4, 2017

jgm commented Jan 4, 2017

jgm commented Jan 4, 2017

jgm commented Jan 4, 2017

mity commented Jan 4, 2017

raphlinus commented Jan 4, 2017

jgm commented Jan 4, 2017 via email

mity commented Jan 4, 2017

mity commented Jan 4, 2017

jgm commented Jan 4, 2017 via email

raphlinus commented Jan 4, 2017

New pathological parsing case (quadratic) #178

New pathological parsing case (quadratic) #178

Comments

jgm commented Jan 4, 2017

jgm commented Jan 4, 2017

jgm commented Jan 4, 2017

jgm commented Jan 4, 2017

mity commented Jan 4, 2017

raphlinus commented Jan 4, 2017

jgm commented Jan 4, 2017 via email

mity commented Jan 4, 2017

mity commented Jan 4, 2017

jgm commented Jan 4, 2017 via email

raphlinus commented Jan 4, 2017