Nested HTML inside block_html is escaped when escape=False, parse_block_html=True #81

tdivis · 2015-12-04T08:40:55Z

Normally with escape=False, nested HTML block is corectly not escaped:

>>> print markdown('<div id="special-part"><div class="subsection">text</div></div>', escape=False)
<div id="special-part"><div class="subsection">text</div></div>

But when I add parse_block_html=True, only out-most element is not escaped and the rest is escaped:

>>> print markdown('<div id="special-part"><div class="subsection">text</div></div>', escape=False, parse_block_html=True)
<div id="special-part">&lt;div class="subsection"&gt;text&lt;/div&gt;</div>

The text was updated successfully, but these errors were encountered:

lepture · 2016-05-22T15:46:22Z

Actually, parse_html_block=True and escape=False is a wrong combination.

tdivis · 2016-10-19T16:58:45Z

Could you explain more about how is that a wrong combination?

I need exactly that, use markdown in HTML block, possibly nested, but it only works for not nested HTML blocks.

Another example (markdown inside table cell is interpreted correctly (<strong>), but nested tags are incorrectly escaped):

print markdown('<table class="special"><tr><td>**text**</td></tr></table>', escape=False, parse_block_html=True)
<table class="special">&lt;tr&gt;&lt;td&gt;<strong>text</strong>&lt;/td&gt;&lt;/tr&gt;</table>

lepture · 2016-10-20T02:20:44Z

@tdivis try the latest code.

tdivis · 2016-11-01T17:17:07Z

My examples work fine now, thanks :-). But I'm afraid, that this change broke some escaping elsewhere as these tests will fail:

ERROR: test_cases.test_normal('/home/glin/bin/mistune/tests/fixtures/normal', 'markdown_documentation_syntax')
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/lib64/python2.7/site-packages/nose/case.py", line 197, in runTest
    self.test(*self.arg)
  File "/home/glin/bin/mistune/tests/test_cases.py", line 30, in render
    raise ValueError(msg)
ValueError:

ters.Ifyouwanttowriteabout'AT&T',youneedtowrite'<code>AT&amp
------Not Equal(4958)------
ters.Ifyouwanttowriteabout'AT&amp;T',youneedtowrite'<code>AT

======================================================================
ERROR: test_cases.test_normal('/home/glin/bin/mistune/tests/fixtures/normal', 'amps_and_angles_encoding')
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/lib64/python2.7/site-packages/nose/case.py", line 197, in runTest
    self.test(*self.arg)
  File "/home/glin/bin/mistune/tests/test_cases.py", line 30, in render
    raise ValueError(msg)
ValueError:

<p>AT&Thasanampersandintheirname.</p
------Not Equal(6)------
<p>AT&amp;Thasanampersandintheirname

teoguso · 2016-11-27T13:10:43Z

@lepture, could you elaborate on why that combination is wrong? I believe this issue is related to a problem in Jupyter/nbconvert, where html tags inside markdown cells are not properly parsed.

Here's the issue: jupyter/nbconvert#328

lepture · 2016-11-28T00:51:47Z

@teoguso my mistake. It should be fixed on the master branch.

teoguso · 2016-11-28T07:42:17Z

@lepture thanks for the quick reply. I tested the latest master and the nbconvert issue is still there. My temporary solution is to downgrade to version 0.7.2. I'm sorry I can't be more helpful.

lepture · 2016-11-28T08:24:30Z

@teoguso could you add a minimized test case for your issue?

teoguso · 2016-11-28T08:53:07Z

@lepture Let me know if this helps: https://github.com/teoguso/nbconvert-mistune-test

deroulers · 2017-01-13T23:03:42Z

@teoguso I think your issue is due to commit 2a33458, which changed the regexp of mistune to parse HTML attributes. Namely, the new regexp in mistune 0.7.3 does not match anymore quote-less HTML attributes, whereas the old regexp did. Therefore, any HTML code with a quote-less (single or double quote) gets recognized as text instead of inline_html and gets (unappropriately) escaped.

For instance: <a href=foo> is treated by mistune 0.7.3 as text and < is escaped as <. But <a href="foo"> is correctly handled. In your minimized test case, the workaround is to put quotes around 700.

Workaround: systematically put quotes in HTML attributes, even if it is not mandatory according to the W3C standards.

@lepture You might want to change line 38 of mistune.py for, e.g.:
_valid_attr = r'''\s*[a-zA-Z\-](?:\=(?:"[^"]*"|'[^']*'|[^\s'">]+))*'''
This will still parse a subset of valid HTML, though.

lepture · 2017-01-14T01:56:06Z

@deroulers Thanks. Here is the fix: f7b5239

teoguso · 2017-02-10T12:23:32Z

@lepture @deroulers thanks! I confirm that the latest master now works without having to use quotes. Do we have to wait for 0.7.4 to get it on pip/conda by default? Cheers!

danzimmerman · 2017-03-20T20:51:36Z

I am having an ongoing issue, similar to @teoguso, with tags in jupyter/nbconvert even with mistune 0.74. If there are spaces around the equals sign for the HTML attribute, the tag is not parsed.

I assume this is related to the quotation mark issue.

teoguso · 2017-06-07T13:49:40Z

@danzimmerman I think standard HTML does not support spaces around the equal sign. This is not a bug I believe.

mpacer · 2017-06-30T19:42:18Z

@teoguso @lepture It actually does, even if it's a bad practice. So, right now it is not following the HTML5 spec:

The attribute name, followed by zero or more space characters, followed by a single U+003D EQUALS SIGN character, followed by zero or more space characters, followed by the attribute value, which, in addition to the requirements given above for attribute values, must not contain any literal space characters, any U+0022 QUOTATION MARK characters ("), U+0027 APOSTROPHE characters ('), U+003D EQUALS SIGN characters (=), U+003C LESS-THAN SIGN characters (<), U+003E GREATER-THAN SIGN characters (>), or U+0060 GRAVE ACCENT characters (`), and must not be the empty string.

source(emphasis mine)

The same restrictions & allowances (more or less) also apply to single quoted and double quoted values.

NB: There is this SO post on why it might be bad practice.

teoguso · 2017-07-03T06:54:54Z

@mpacer I see. Thanks, that's good to know! I guess the HTML landscape can be a bit messy...

lepture · 2018-09-15T03:19:30Z

Fixed in 0.8.4

tdivis mentioned this issue Dec 4, 2015

Content of block_html should be parsed by BlockLexer #82

Closed

teoguso mentioned this issue Nov 27, 2016

Convert notebook to Static HTML -- Markdown cells with html image references not viewable jupyter/nbconvert#328

Closed

deroulers referenced this issue Jan 13, 2017

Fix html attribute regex. Close #99

2a33458

jvanasco mentioned this issue Sep 6, 2017

nested html links are broken with parse_block_html #137

Closed

lepture closed this as completed Sep 15, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nested HTML inside block_html is escaped when escape=False, parse_block_html=True #81

Nested HTML inside block_html is escaped when escape=False, parse_block_html=True #81

tdivis commented Dec 4, 2015

lepture commented May 22, 2016

tdivis commented Oct 19, 2016

lepture commented Oct 20, 2016

tdivis commented Nov 1, 2016

teoguso commented Nov 27, 2016 •

edited

Loading

lepture commented Nov 28, 2016

teoguso commented Nov 28, 2016

lepture commented Nov 28, 2016

teoguso commented Nov 28, 2016

deroulers commented Jan 13, 2017

lepture commented Jan 14, 2017

teoguso commented Feb 10, 2017

danzimmerman commented Mar 20, 2017

teoguso commented Jun 7, 2017

mpacer commented Jun 30, 2017 •

edited

Loading

teoguso commented Jul 3, 2017

lepture commented Sep 15, 2018

Nested HTML inside block_html is escaped when escape=False, parse_block_html=True #81

Nested HTML inside block_html is escaped when escape=False, parse_block_html=True #81

Comments

tdivis commented Dec 4, 2015

lepture commented May 22, 2016

tdivis commented Oct 19, 2016

lepture commented Oct 20, 2016

tdivis commented Nov 1, 2016

teoguso commented Nov 27, 2016 • edited Loading

lepture commented Nov 28, 2016

teoguso commented Nov 28, 2016

lepture commented Nov 28, 2016

teoguso commented Nov 28, 2016

deroulers commented Jan 13, 2017

lepture commented Jan 14, 2017

teoguso commented Feb 10, 2017

danzimmerman commented Mar 20, 2017

teoguso commented Jun 7, 2017

mpacer commented Jun 30, 2017 • edited Loading

teoguso commented Jul 3, 2017

lepture commented Sep 15, 2018

teoguso commented Nov 27, 2016 •

edited

Loading

mpacer commented Jun 30, 2017 •

edited

Loading