Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Current code in remarkShortcodes.js misunderstands how remark's block tokenizers are supposed to work #7315

Open
manulari opened this issue Oct 28, 2024 · 0 comments
Labels
type: bug code to address defects in shipped code

Comments

@manulari
Copy link

Describe the bug

remarkShortcodes.js uses the blockTokenizers extension mechanism of remark-parse. The documentation for the last version of remark-parse which supported this feature can be found here. Quoting from there (emphasis mine):

Tokenizers test whether a document starts with a certain syntactic entity.

I understand this to mean that any registered block tokenizer (like the one registered by remarkShortcodes.js) is tried at the beginning of each "block" in a markdown document. (Probably roughly corresponding to blocks of lines separated by two consecutive newlines, but I couldn't find clear documentation on that.)

The current version of remarkShortcodes.js is a bit complex, but I think essentially it looks for matches to the shortcode patterns at all paragraphs in the remaining document passed in by remark and then chooses the earliest of those matches.
This means that the bevaviour is both incorrect, and runtime is quadratic in the length of the source document (or at least the number of paragraphs). A previous version of remarkShortcodes.js had code that was more correct.

This currently shows up as the "Sent invalid data to remark"-warning message, which has been reported a couple of times. At least 1, 2, 3 are related, probably more.

This warning message is emitted by a try-catch block in remarkShortcodes.js that actually papers over the logic bug in remarkShortcodes.js. Specifically, this warning message will be emitted whenever there is a shortcode that matches somewhere in the document, but not at the current parsing position. The eat(match[0]) then causes remark to throw an error.

To Reproduce

Look at the source code for remarkShortcodes.js and the documentation for remark block tokenizers.

Expected behavior

I believe a more correct behaviour would be achieved in the following way:

  • Revert remarkShortcodes.js to this version.
  • Clearly document and/or enforce that patterns for all components registered with registerEditorComponent have to start with ^, to only match at the beginning of the block currently being parsed by remark.
  • Clearly document and/or enforce that these patterns should not use RegExp multiline mode. Document that they will be run only at block boundaries. (Why should they never use multiline mode?: This would defeat the ^ at the beginning of the pattern, as the pattern will now again match at every line of the document.)
  • Perhaps check in remarkShortcodes.js that match.index == 0, and in case this is violated, error out early on with a descriptive error message that says that a registered editor component is using an incorrect pattern. (Any such use of a pattern not starting with ^ will mess up the whole shortcode parsing code.)

Applicable Versions:

main branch

@manulari manulari added the type: bug code to address defects in shipped code label Oct 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: bug code to address defects in shipped code
Projects
None yet
Development

No branches or pull requests

2 participants
@manulari and others