-
-
Notifications
You must be signed in to change notification settings - Fork 318
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ensure that trailing/leading Unicode whitespaces/Vertical Tab/Form Feed/No-break Space in paragraphs are not removed #777
Comments
IMO commonmarkjs / cmark are wrong here. Spec is super clear about this. |
I agree with you.
Does your opinion come from the definition of space and tab? To arrive at our interpretation of the spec, I think the specifications must be interpreted strictly, like a robot, without unnecessary distractions. Leading [space]s or [tab]s are skipped:
[space]: #space
[tab]: #tab I wish new additional test cases that Unicode white spaces should be preserved would be added. The current examples and test cases use only ASCII space. |
It was not clear to me what this issue was about. Are you saying one example of unicode whitespace, at the beginning and end of a line, would solve this issue for you?
I don’t see it. The spec is very clear about characters: https://spec.commonmark.org/0.31.2/#characters-and-lines. Specs are always a bit technical. They must be interpreted strictly. |
I want an example like the following to be available as an officially-provided example or test case:
The syntax itself is so. It might be due to the word choices ("space" and "tab", both are very common words) that library authors have incorrectly implemented the non-Unicode whitespace handling. (or due to the lack of understanding of |
Agreed, there should be a test case with, say, NBSP at the beginning and end of a paragraph, showing that it is not stripped. |
commonmark/commonmark.js#261
commonmark/commonmark.js#289
https://spec.commonmark.org/0.31.2/#paragraphs
The specification doesn't mention non-ASCII Unicode whitespaces, no-break space, form feed, or vertical tab, but some implementations treats them like ASCII space and tab, and remove trailing or leading ones in paragraphs.
cmark seems to remove at least U+00A0 NBSP, (but not U+3000 or U+2003)preserves U+00A0 tooHowever cmark removes trailing Form Feeds and Vertical Tabs.
commonmark.js removes all characters trimmed by
trim()
of JS.https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Lexical_grammar
micromark doesn't touch other than ASCII space (U+0020) or tab (U+0009).
Other implementations:
https://babelmark.github.io/?text=%0BVertical+Tab+(U%2B000B)%0B%0A%0A%0CLine+Separator+(U%2B000C)%0C%0A%0A+Space+(U%2B0020)+%0A%0A%09Tab+(U%2B0009)%09%0A%0A%C2%A0NBSP+(U%2B00A0)%C2%A0%0A%0A%E2%80%83EM+Space+(U%2B2003)%E2%80%83%0A%0A%E2%80%A8Line+Separator+(U%2B2028)%E2%80%A8%0A%0A%E2%80%A9Paragraph+Separator+(U%2B2029)%E2%80%A9%0A%0A%E3%80%80Chinese%2FJapanese+Space+(U%2B3000)%E3%80%80%0A
The text was updated successfully, but these errors were encountered: