Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Documentation should cover access to raw AST #215

Closed
MostAwesomeDude opened this issue May 9, 2013 · 6 comments
Closed

Documentation should cover access to raw AST #215

MostAwesomeDude opened this issue May 9, 2013 · 6 comments

Comments

@MostAwesomeDude
Copy link

Hi,

Some templating systems do not allow interpolation of textual HTML tags, for security reasons. It should be trivial for those systems to still use Python-Markdown, by taking the raw AST and turning it into HTML on their own. This would be a useful stepping stone for supporting non-HTML formats as well.

@MostAwesomeDude
Copy link
Author

I see that the AST in this case is an XML string. Whatever; it would still be of immense use to the outside world. Let's fix that so that the XML AST can be accessed as easily as the final HTML result.

@frontierventures
Copy link

Yes, fix that please!

@waylan
Copy link
Member

waylan commented May 10, 2013

Yes, you're right, this should be documented. In fact there's a few ways it could be done. You could create a subclass on the Markdown class and reimplement the set_output_format method. Or an even easier approach would be to create an extension which monkeypatches the class and assigns a new serializer (the Extension API gives you access to the entire class instance so you can do stuff like this).

Actually the AST is not an xml string, but an ElementTree Object. The serializers (Markdown ships with two) convert that to a string. The issue is that postprocessors do additional processing after the serializer runs. Of course, those postprocessors assume HTML. And until a better way of parsing raw html is implemented, the postprocessors are necessary - but then again, raw html would need to be handled differently if the output format wasn't html. Guess that is why I suggest using an extension to monkeypatch the class - the same extension could replace the postprocessors to match the new output format.

And given the above complications, that is why I haven't really documented it yet. It's not just a matter of passing in a new serializer. That said, I've just updated the docs to at least list the relevant class attributes so anyone interested at least knows what part of the code to start looking at. Of course, that's not enough, I need to document everything in that list. I'll leave this issue open until that's done. Of course, patches (pull requests) are welcome.

@waylan
Copy link
Member

waylan commented May 10, 2013

@MostAwesomeDude, I just reread your request. I seem to have missed the part about other templating systems accessing the AST directly. While I suppose a serializer could do this, the way things are implemented, it wouldn't be ideal. In fact, because the postprocessors (which work on the serialized string) are so tightly coupled with the rest of the parser, it's not very practical - or at least you potentially lose access to some significant features of the Markdown's syntax. Actually, my overhaul of the serializer code involved quite a bit of decoupling of the various parts of the parser already. Prior to that, there was absolutely no way to access anything but the final string.

To address your issue directly, in my experience, most HTML templating systems offer a way to mark a string as "safe" so that it will be allowed to be inserted in the document unescaped (see django for an example). If your system doesn't allow this, I'd suggest that that is an issue with your templating system. Of course, safety is important, and if you are accepting markdown text from untrusted sources you should be scrubbing it anyway. I recommend a tool like Bleach for such a case. Pass the output of markdown into bleach and pass the output of bleach as a "safe" string to your template.

@MostAwesomeDude
Copy link
Author

Well, I would not say that none of Chloe (http://docs.factorcode.org/content/article-html.templates.chloe.html), Hamlet (http://hackage.haskell.org/package/hamlet), nor twisted.web.template (http://twistedmatrix.com/documents/current/web/howto/twisted-templates.html) are broken simply because they don't allow arbitrary safe strings.

Thanks for your time. I had apparently not dug deeply enough into the API documentation to realize what is offered in terms of extensibility. I'd say that the documentation I wanted to see has already been written!

@toabi
Copy link

toabi commented Aug 26, 2014

Just for the record, because I needed this and found a fast way: What I tried first:

output_formats = { 'etree': lambda el: el }

But then issues arise because the end of convert() excepts strings. So I did that:

I subclassed markdown.Markdown and did override the convert method. Then copy-pasting the former method, but deleting everything after running the tree-processors and returning root gets the job done.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants