Leave callback not called for HTML block comment #202

step- · 2024-01-08T17:33:49Z

For the specific case of a one-line block comment (HTML block type 2) not followed by a blank line[1], MD4C fails to call the leave callback. Consequently, if another HTML comment follows the first comment, both end up being combined into a single block instead of staying as two separate blocks, each with its own callback.

[1] In my comments to issue #200 I remarked that the spec doesn't require a blank line to close a
type 2 HTML block.

DETAILS

I can't think of a simple way to demonstrate this issue using md2html alone, precisely because the lack of something can't be shown. Feeding md2html the following markdown:

<!-- C1 -->
<!-- C2 -->

outputs the input text so one would be inclined to think that everything is correct. However, what can't be seen is that there is no callback between lines C1 and C2, while there should be one. Instead, a callback after line C2 wrongly combines the two single-line blocks into a two-line block. Trace the code or add printf statements as needed to see the issue at work.

The text was updated successfully, but these errors were encountered:

mity · 2024-01-08T19:13:50Z

Thanks for the report. If I understand correctly, the issue is that the implementation sort of merges two HTML blocks together if they follow one after another, and calls the enter_block just once before the 1st one, and leave_block after the last one.

However I wonder what your expectations really are and whether we need to fix this at all. Consider especially that the type 7 of HTML blocks may contain not one but whole set of HTML tags; and it may even be an invalid HTML chunk after the initial tag condition. All what matters is whether it satisfies the opening and closing conditions.

I.e. what the enter_block(MD_BLOCK_HTML) tells the renderer is: Whatever you get in the text_callback should be treated as a raw HTML, until you see leave_block(MD_BLOCK_HTML). However it does not generally break the HTML into individual tags and does not even attempt any HTML validation of it beyond the Markdown specification, and it wouldn't (at least for the type 7) even after "fixing" this.

step- · 2024-01-08T19:36:47Z

You understood correctly but enlarged the scope to all HTML block types. I'm only concerned about type 2 HTML blocks. Not interested in type 7. Indeed PR #203 specifically uses the value of html_block_type (2) to restrict the scope of my change.

step- · 2024-01-08T19:53:10Z

My expectation is that it gets fixed for type 2 HTML blocks because cmark renders every HTML block comment as a separate block, and the application I'm developing on top of MD4C relies on the same behavior.

## Version 0.5.1 Changes: * LaTeX math extension (`MD_FLAG_LATEXMATHSPANS`) now requires that opener mark is not immediately preceded with alpha-numeric character and similarly that closer mark is not immediately followed with alpha-numeric character. So for example `foo$ x + y = z $` is not recognized as LaTeX equation anymore because there is no space between `foo` and the opening `$`. * Table extension (`MD_FLAG_TABLES`) now recognizes only tables with no more than 128 columns. This limit has been imposed to prevent a pathological case of quadratic output size explosion which could be used as DoS attack vector. * We are now more strict with `MD_FLAG_PERMISSIVExxxAUTOLINKS` family of extensions with respect to non-alphanumeric characters, with the aim to mitigate false positive detections. Only relatively few selected non-alphanumeric are now allowed in permissive e-mail auto-links (`MD_FLAG_PERMISSIVEEMAILAUTOLINKS`): - `.`, `-`, `_`, `+` in user name part of e-mail address; and - `.`, `-`, `_` in host part of the e-mail address. Similarly for URL and e-mail auto-links (`MD_FLAG_PERMISSIVEURLAUTOLINKS` and `MD_FLAG_PERMISSIVEWWWAUTOLINKS`): - `.`, `-`, `_` in host part of the URL; - `/`, `.`, `-`, `_` in path part of the URL; - `&`, `.`, `-`, `+`, `_`, `=`, `(`, `)` in the query part of the URL (additionally, if present, `(` and `)` must form balanced pairs); and - `.`, `-`, `+`, `_` in the fragment part of the URL. Furthermore these characters (with some exceptions like where they serve as delimiter characters, e.g. `/` for paths) are generally accepted only when an alphanumeric character both precedes and follows them (i.e. these cannot be "stacked" together). Fixes: * Fix several bugs where we haven't properly respected already resolved spans of higher precedence level in handling of permissive auto-links extensions (family of `MD_FLAG_PERMISSIVExxxAUTOLINKS` flags), LaTeX math extension (`MD_FLAG_LATEXMATHSPANS`) and wiki-links extension (`MD_FLAG_WIKILINKS`) of the form `[[label|text]]` (with pipe `|`). In some complex cases this could lead to invalid internal parser state and memory corruption. Identified with [OSS-Fuzz](https://github.com/google/oss-fuzz). * [#222](mity/md4c#222): Fix strike-through extension (`MD_FLAG_STRIKETHROUGH`) which did not respect same rules for pairing opener and closer marks as other emphasis spans. * [#223](mity/md4c#223): Fix incorrect handling of new-line character just at the beginning and/or end of a code span where we were not following CommonMark specification requirements correctly. ## Version 0.5.0 Changes: * Changes mandated by CommonMark specification 0.30. Actually there are only very minor changes to recognition of HTML blocks: - The tag `<textarea>` now triggers HTML block (of type 1 as per the specification). - HTML declaration (HTML block type 4) is not required to begin with an upper-case ASCII character after the `<!`. Any ASCII character is now allowed. Also it now doesn't require a whitespace before the closing `>`. Other than that, the newest specification mainly improves test coverage and clarifies its wording in some cases, without affecting the implementation. Refer to [CommonMark 0.30 notes](https://github.com/commonmark/commonmark-spec/releases/tag/0.30) for more info. * Make Unicode-specific code compliant to Unicode 15.1. * Update list of entities known to the HTML renderer from https://html.spec.whatwg.org/entities.json. New Features: * Add extension allowing to treat all soft break as hard ones. It has to be explicitly enabled with `MD_FLAG_HARD_SOFT_BREAKS`. Contributed by [l-m](https://github.com/l1mey112). * Structure `MD_SPAN_A_DETAIL` now has a new member `is_autolink`. Contributed by [Jens Alfke](https://github.com/snej). * `md2html` utility now supports command line options `--html-title` and `--html-css`. Contributed by [Andreas Baumann](https://github.com/andreasbaumann). Fixes: * [#163](mity/md4c#163): Make HTML renderer to emit `'\n'` after the root tag when in the XHTML mode. * [#165](mity/md4c#165): Make HTML renderer not to percent-encode `'~'` in URLs. Although it does work, it's not needed, and it can actually be confusing with URLs such as `http://www.example.com/~johndoe/`. * [#167](mity/md4c#167), [#168](mity/md4c#168): Fix multiple instances of various buffer overflow bugs, found mostly using a fuzz testing. Contributed by [dtldarek](https://github.com/dtldarek) and [Thierry Coppey](https://github.com/TCKnet). * [#169](mity/md4c#169): Table underline now does not require 3 characters per table column anymore. One dash (optionally with a leading or tailing `:` appended or prepended) is now sufficient. This improves compatibility with the GFM. * [#172](mity/md4c#172): Fix quadratic time behavior caused by unnecessary lookup for link reference definition even if the potential label contains nested brackets. * [#173](mity/md4c#173), [#174](mity/md4c#174), [#212](mity/md4c#212), [#213](mity/md4c#213): Multiple bugs identified with [OSS-Fuzz](https://github.com/google/oss-fuzz) were fixed. * [#190](mity/md4c#190), [#200](mity/md4c#200), [#201](mity/md4c#201): Multiple fixes of incorrect interactions of indented code block with a preceding block. * [#202](mity/md4c#202): We were not correctly calling `enter_block()` and `leave_block()` callbacks if multiple HTML blocks followed one after another; instead previously such blocks were merged into one. (This may likely impact only applications interested in Markdown's AST, and not just converting Markdown to other formats like HTML.) * [#210](mity/md4c#210): The `md2html` utility now handles nested images with optional titles correctly. * [#214](mity/md4c#214): Tags `<h2>` ... `<h6>` incorrectly did not trigger HTML block. * [#215](mity/md4c#215): The parser incorrectly did not accept optional tabs after setext header underline. * [#217](mity/md4c#217): The parser incorrectly resolved emphasis in some situations, if the emphasis marks were enclosed by punctuation characters.

step- mentioned this issue Jan 8, 2024

Fix leave callback not called for HTML block comment. #203

Closed

mity closed this as completed in 319631f Jan 8, 2024

mity added the bug label Jan 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Leave callback not called for HTML block comment #202

Leave callback not called for HTML block comment #202

step- commented Jan 8, 2024 •

edited

Loading

mity commented Jan 8, 2024

step- commented Jan 8, 2024 •

edited

Loading

step- commented Jan 8, 2024

Leave callback not called for HTML block comment #202

Leave callback not called for HTML block comment #202

Comments

step- commented Jan 8, 2024 • edited Loading

mity commented Jan 8, 2024

step- commented Jan 8, 2024 • edited Loading

step- commented Jan 8, 2024

step- commented Jan 8, 2024 •

edited

Loading

step- commented Jan 8, 2024 •

edited

Loading