Documentation should cover access to raw AST #215

MostAwesomeDude · 2013-05-09T14:13:21Z

Hi,

Some templating systems do not allow interpolation of textual HTML tags, for security reasons. It should be trivial for those systems to still use Python-Markdown, by taking the raw AST and turning it into HTML on their own. This would be a useful stepping stone for supporting non-HTML formats as well.

MostAwesomeDude · 2013-05-09T14:23:19Z

I see that the AST in this case is an XML string. Whatever; it would still be of immense use to the outside world. Let's fix that so that the XML AST can be accessed as easily as the final HTML result.

frontierventures · 2013-05-09T15:14:29Z

Yes, fix that please!

waylan · 2013-05-10T02:58:20Z

Yes, you're right, this should be documented. In fact there's a few ways it could be done. You could create a subclass on the Markdown class and reimplement the set_output_format method. Or an even easier approach would be to create an extension which monkeypatches the class and assigns a new serializer (the Extension API gives you access to the entire class instance so you can do stuff like this).

Actually the AST is not an xml string, but an ElementTree Object. The serializers (Markdown ships with two) convert that to a string. The issue is that postprocessors do additional processing after the serializer runs. Of course, those postprocessors assume HTML. And until a better way of parsing raw html is implemented, the postprocessors are necessary - but then again, raw html would need to be handled differently if the output format wasn't html. Guess that is why I suggest using an extension to monkeypatch the class - the same extension could replace the postprocessors to match the new output format.

And given the above complications, that is why I haven't really documented it yet. It's not just a matter of passing in a new serializer. That said, I've just updated the docs to at least list the relevant class attributes so anyone interested at least knows what part of the code to start looking at. Of course, that's not enough, I need to document everything in that list. I'll leave this issue open until that's done. Of course, patches (pull requests) are welcome.

waylan · 2013-05-10T03:27:09Z

@MostAwesomeDude, I just reread your request. I seem to have missed the part about other templating systems accessing the AST directly. While I suppose a serializer could do this, the way things are implemented, it wouldn't be ideal. In fact, because the postprocessors (which work on the serialized string) are so tightly coupled with the rest of the parser, it's not very practical - or at least you potentially lose access to some significant features of the Markdown's syntax. Actually, my overhaul of the serializer code involved quite a bit of decoupling of the various parts of the parser already. Prior to that, there was absolutely no way to access anything but the final string.

To address your issue directly, in my experience, most HTML templating systems offer a way to mark a string as "safe" so that it will be allowed to be inserted in the document unescaped (see django for an example). If your system doesn't allow this, I'd suggest that that is an issue with your templating system. Of course, safety is important, and if you are accepting markdown text from untrusted sources you should be scrubbing it anyway. I recommend a tool like Bleach for such a case. Pass the output of markdown into bleach and pass the output of bleach as a "safe" string to your template.

MostAwesomeDude · 2013-05-10T21:48:18Z

Well, I would not say that none of Chloe (http://docs.factorcode.org/content/article-html.templates.chloe.html), Hamlet (http://hackage.haskell.org/package/hamlet), nor twisted.web.template (http://twistedmatrix.com/documents/current/web/howto/twisted-templates.html) are broken simply because they don't allow arbitrary safe strings.

Thanks for your time. I had apparently not dug deeply enough into the API documentation to realize what is offered in terms of extensibility. I'd say that the documentation I wanted to see has already been written!

toabi · 2014-08-26T10:18:40Z

Just for the record, because I needed this and found a fast way: What I tried first:

output_formats = { 'etree': lambda el: el }

But then issues arise because the end of convert() excepts strings. So I did that:

I subclassed markdown.Markdown and did override the convert method. Then copy-pasting the former method, but deleting everything after running the tree-processors and returning root gets the job done.

MostAwesomeDude closed this as completed May 10, 2013

mthuurne mentioned this issue Jun 17, 2019

Abandon or Modify ElementTree? #420

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Documentation should cover access to raw AST #215

Documentation should cover access to raw AST #215

MostAwesomeDude commented May 9, 2013

MostAwesomeDude commented May 9, 2013

frontierventures commented May 9, 2013

waylan commented May 10, 2013

waylan commented May 10, 2013

MostAwesomeDude commented May 10, 2013

toabi commented Aug 26, 2014

Documentation should cover access to raw AST #215

Documentation should cover access to raw AST #215

Comments

MostAwesomeDude commented May 9, 2013

MostAwesomeDude commented May 9, 2013

frontierventures commented May 9, 2013

waylan commented May 10, 2013

waylan commented May 10, 2013

MostAwesomeDude commented May 10, 2013

toabi commented Aug 26, 2014