Current code in remarkShortcodes.js misunderstands how remark's block tokenizers are supposed to work #7315

manulari · 2024-10-28T10:36:13Z

Describe the bug

remarkShortcodes.js uses the blockTokenizers extension mechanism of remark-parse. The documentation for the last version of remark-parse which supported this feature can be found here. Quoting from there (emphasis mine):

Tokenizers test whether a document starts with a certain syntactic entity.

I understand this to mean that any registered block tokenizer (like the one registered by remarkShortcodes.js) is tried at the beginning of each "block" in a markdown document. (Probably roughly corresponding to blocks of lines separated by two consecutive newlines, but I couldn't find clear documentation on that.)

The current version of remarkShortcodes.js is a bit complex, but I think essentially it looks for matches to the shortcode patterns at all paragraphs in the remaining document passed in by remark and then chooses the earliest of those matches.
This means that the bevaviour is both incorrect, and runtime is quadratic in the length of the source document (or at least the number of paragraphs). A previous version of remarkShortcodes.js had code that was more correct.

This currently shows up as the "Sent invalid data to remark"-warning message, which has been reported a couple of times. At least 1, 2, 3 are related, probably more.

This warning message is emitted by a try-catch block in remarkShortcodes.js that actually papers over the logic bug in remarkShortcodes.js. Specifically, this warning message will be emitted whenever there is a shortcode that matches somewhere in the document, but not at the current parsing position. The eat(match[0]) then causes remark to throw an error.

To Reproduce

Look at the source code for remarkShortcodes.js and the documentation for remark block tokenizers.

Expected behavior

I believe a more correct behaviour would be achieved in the following way:

Revert remarkShortcodes.js to this version.
Clearly document and/or enforce that patterns for all components registered with registerEditorComponent have to start with ^, to only match at the beginning of the block currently being parsed by remark.
Clearly document and/or enforce that these patterns should not use RegExp multiline mode. Document that they will be run only at block boundaries. (Why should they never use multiline mode?: This would defeat the ^ at the beginning of the pattern, as the pattern will now again match at every line of the document.)
Perhaps check in remarkShortcodes.js that match.index == 0, and in case this is violated, error out early on with a descriptive error message that says that a registered editor component is using an incorrect pattern. (Any such use of a pattern not starting with ^ will mess up the whole shortcode parsing code.)

Applicable Versions:

main branch

The text was updated successfully, but these errors were encountered:

manulari added the type: bug code to address defects in shipped code label Oct 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Current code in remarkShortcodes.js misunderstands how remark's block tokenizers are supposed to work #7315

Current code in remarkShortcodes.js misunderstands how remark's block tokenizers are supposed to work #7315

manulari commented Oct 28, 2024

Current code in remarkShortcodes.js misunderstands how remark's block tokenizers are supposed to work #7315

Current code in remarkShortcodes.js misunderstands how remark's block tokenizers are supposed to work #7315

Comments

manulari commented Oct 28, 2024