Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nested HTML inside block_html is escaped when escape=False, parse_block_html=True #81

Closed
tdivis opened this issue Dec 4, 2015 · 17 comments

Comments

@tdivis
Copy link

tdivis commented Dec 4, 2015

Normally with escape=False, nested HTML block is corectly not escaped:

>>> print markdown('<div id="special-part"><div class="subsection">text</div></div>', escape=False)
<div id="special-part"><div class="subsection">text</div></div>

But when I add parse_block_html=True, only out-most element is not escaped and the rest is escaped:

>>> print markdown('<div id="special-part"><div class="subsection">text</div></div>', escape=False, parse_block_html=True)
<div id="special-part">&lt;div class="subsection"&gt;text&lt;/div&gt;</div>
@lepture
Copy link
Owner

lepture commented May 22, 2016

Actually, parse_html_block=True and escape=False is a wrong combination.

@tdivis
Copy link
Author

tdivis commented Oct 19, 2016

Could you explain more about how is that a wrong combination?

I need exactly that, use markdown in HTML block, possibly nested, but it only works for not nested HTML blocks.

Another example (markdown inside table cell is interpreted correctly (<strong>), but nested tags are incorrectly escaped):

print markdown('<table class="special"><tr><td>**text**</td></tr></table>', escape=False, parse_block_html=True)
<table class="special">&lt;tr&gt;&lt;td&gt;<strong>text</strong>&lt;/td&gt;&lt;/tr&gt;</table>

@lepture
Copy link
Owner

lepture commented Oct 20, 2016

@tdivis try the latest code.

@tdivis
Copy link
Author

tdivis commented Nov 1, 2016

My examples work fine now, thanks :-). But I'm afraid, that this change broke some escaping elsewhere as these tests will fail:

ERROR: test_cases.test_normal('/home/glin/bin/mistune/tests/fixtures/normal', 'markdown_documentation_syntax')
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/lib64/python2.7/site-packages/nose/case.py", line 197, in runTest
    self.test(*self.arg)
  File "/home/glin/bin/mistune/tests/test_cases.py", line 30, in render
    raise ValueError(msg)
ValueError:

ters.Ifyouwanttowriteabout'AT&T',youneedtowrite'<code>AT&amp
------Not Equal(4958)------
ters.Ifyouwanttowriteabout'AT&amp;T',youneedtowrite'<code>AT

======================================================================
ERROR: test_cases.test_normal('/home/glin/bin/mistune/tests/fixtures/normal', 'amps_and_angles_encoding')
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/lib64/python2.7/site-packages/nose/case.py", line 197, in runTest
    self.test(*self.arg)
  File "/home/glin/bin/mistune/tests/test_cases.py", line 30, in render
    raise ValueError(msg)
ValueError:

<p>AT&Thasanampersandintheirname.</p
------Not Equal(6)------
<p>AT&amp;Thasanampersandintheirname

@teoguso
Copy link

teoguso commented Nov 27, 2016

@lepture, could you elaborate on why that combination is wrong? I believe this issue is related to a problem in Jupyter/nbconvert, where html tags inside markdown cells are not properly parsed.

Here's the issue: jupyter/nbconvert#328

@lepture
Copy link
Owner

lepture commented Nov 28, 2016

@teoguso my mistake. It should be fixed on the master branch.

@teoguso
Copy link

teoguso commented Nov 28, 2016

@lepture thanks for the quick reply. I tested the latest master and the nbconvert issue is still there. My temporary solution is to downgrade to version 0.7.2. I'm sorry I can't be more helpful.

@lepture
Copy link
Owner

lepture commented Nov 28, 2016

@teoguso could you add a minimized test case for your issue?

@teoguso
Copy link

teoguso commented Nov 28, 2016

@lepture Let me know if this helps: https://github.com/teoguso/nbconvert-mistune-test

@deroulers
Copy link

@teoguso I think your issue is due to commit 2a33458, which changed the regexp of mistune to parse HTML attributes. Namely, the new regexp in mistune 0.7.3 does not match anymore quote-less HTML attributes, whereas the old regexp did. Therefore, any HTML code with a quote-less (single or double quote) gets recognized as text instead of inline_html and gets (unappropriately) escaped.

For instance: <a href=foo> is treated by mistune 0.7.3 as text and < is escaped as &lt;. But <a href="foo"> is correctly handled. In your minimized test case, the workaround is to put quotes around 700.

Workaround: systematically put quotes in HTML attributes, even if it is not mandatory according to the W3C standards.

@lepture You might want to change line 38 of mistune.py for, e.g.:
_valid_attr = r'''\s*[a-zA-Z\-](?:\=(?:"[^"]*"|'[^']*'|[^\s'">]+))*'''
This will still parse a subset of valid HTML, though.

@lepture
Copy link
Owner

lepture commented Jan 14, 2017

@deroulers Thanks. Here is the fix: f7b5239

@teoguso
Copy link

teoguso commented Feb 10, 2017

@lepture @deroulers thanks! I confirm that the latest master now works without having to use quotes. Do we have to wait for 0.7.4 to get it on pip/conda by default? Cheers!

@danzimmerman
Copy link

I am having an ongoing issue, similar to @teoguso, with tags in jupyter/nbconvert even with mistune 0.74. If there are spaces around the equals sign for the HTML attribute, the tag is not parsed.

I assume this is related to the quotation mark issue.

@teoguso
Copy link

teoguso commented Jun 7, 2017

@danzimmerman I think standard HTML does not support spaces around the equal sign. This is not a bug I believe.

@mpacer
Copy link
Contributor

mpacer commented Jun 30, 2017

@teoguso @lepture It actually does, even if it's a bad practice. So, right now it is not following the HTML5 spec:

The attribute name, followed by zero or more space characters, followed by a single U+003D EQUALS SIGN character, followed by zero or more space characters, followed by the attribute value, which, in addition to the requirements given above for attribute values, must not contain any literal space characters, any U+0022 QUOTATION MARK characters ("), U+0027 APOSTROPHE characters ('), U+003D EQUALS SIGN characters (=), U+003C LESS-THAN SIGN characters (<), U+003E GREATER-THAN SIGN characters (>), or U+0060 GRAVE ACCENT characters (`), and must not be the empty string.

source(emphasis mine)

The same restrictions & allowances (more or less) also apply to single quoted and double quoted values.

NB: There is this SO post on why it might be bad practice.

@teoguso
Copy link

teoguso commented Jul 3, 2017

@mpacer I see. Thanks, that's good to know! I guess the HTML landscape can be a bit messy...

@lepture
Copy link
Owner

lepture commented Sep 15, 2018

Fixed in 0.8.4

@lepture lepture closed this as completed Sep 15, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants