Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add generic syntax for extensions/plugins #99

Open
SimonSapin opened this issue Aug 9, 2014 · 32 comments
Open

Add generic syntax for extensions/plugins #99

SimonSapin opened this issue Aug 9, 2014 · 32 comments

Comments

@SimonSapin
Copy link

Rust uses Hoedown for its documentation. We would like to add a bunch of new features (cross-references, math, …) that would need to be represented somehow in Markdown source. Rather than inventing new punctuation-based syntax for each of them, I believe (as discussed here) that we should only have one additional inline-level syntax and one block-level that each include the name of the "plugin". The built-in HTML renderer would not implement any plugin, it’s up to rustdoc and other users to do that in a custom renderer.

So, proposal. I’m not attached to the syntax or the naming, feel free to change them as you like. I just picked something similar to reStructuredText.

New inline (span) syntax for a "role": :role-name:content. Like in code spans, backticks can be repeated and whitespace is trimmed: `:foo: ` ``` (the content here is a single backtick.)

New block-level syntax for a "directive":

.. directive-name:: header
    indented content

    … that goes on until the end of the indentation, like unfenced code blocks.

Renderers would have two new additional callbacks.

struct hoedown_renderer {
    // ...
    void (*directive)(hoedown_buffer *ob,  const hoedown_buffer *name,  const hoedown_buffer *header, const hoedown_buffer *text,void *opaque);
    int (*role)(hoedown_buffer *ob, const hoedown_buffer *name, const hoedown_buffer *text, void *opaque);
}

The default behavior when the callback is NULL or returns NULL/0 would be as usual: directives are skipped and roles are printed verbatim. The built-in HTML renderer would not provide these callbacks.

Does this sound like a good idea?

@mildsunrise
Copy link
Member

I love how the content is specified inside codespans and codeblocks; so even if the syntax is not supported, you still get a nice codeblock as fallback:

See :ref:hoedown_document_free(hoedown_document *doc) for more info.

I'd like to suggest something: for the directive syntax, could make it use the fenced code block syntax? Something like:

~~~ directive-name:: header
Directive content goes here.
~~~

This would be trivial to parse, and again if the plugin is not supported you get a simple codeblock:

Directive content goes here.

Side note: It'd actually be a good idea to have math in its own extension, between $dollars$ for similarity to TeX. Mathematics StackExchange do it this way. Cross references are a more specific feature, though.

@SimonSapin
Copy link
Author

Yeah, fenced code blocks could be used for block-level directives, but kinda feels like an abuse. And maybe not as nice as the reStructuredText syntax for directives that are not as "content-oriented" as code blocks. See for example this source documentation using autodoc: https://raw.githubusercontent.com/mitsuhiko/flask/master/docs/api.rst

Yes, math might be worth it’s own syntax. But rendering math is non-obvious, and so maybe out of scope for Hoedown?

@devinus
Copy link
Member

devinus commented Aug 9, 2014

Well frequently requested extensions (as long as they remained maintained) are never out of scope for Hoedown.

@SimonSapin
Copy link
Author

To clarify the proposed directive syntax: both header and content are optional.

.. include:: README.md
.. table-of-contents::

@mildsunrise
Copy link
Member

But rendering math is non-obvious, and so maybe out of scope for Hoedown?

Just like fenced code blocks. The callback can highlight the code, or leave that to the client or postprocessor. The HTML renderer does the latter, but it can be overriden.

In the case of math, the most appropiate for HTML would be to output it without changes, so that MathJax can display it.

@mildsunrise
Copy link
Member

And maybe not as nice as the reStructuredText syntax for directives that are not as "content-oriented" as code blocks.

You're right. So, if we use this syntax, could we enforce a blank line between the directive header and the content? i.e. this would be invalid:

.. directive-name:: header
    indented content

And should be written as:

.. directive-name:: header

    indented content

That way, we get a nice codeblock if the extension's not present.

@SimonSapin
Copy link
Author

I don’t think it is necessary to have the syntax be compatible a Markdown parser that doesn’t know about directives. The Hoedown parser would always be the same, it’s only different renderers that would do different things with directives. The built-in one could render them the same as code blocks.

@mildsunrise
Copy link
Member

I think it'd make a good effect. For instance, a README.md file containing roles and directives would be displayed by Github in a neat way. IMO it's important for a generic syntax to have fallback.

Roles are already compatible, and directives only need a blank line before the content (which you always use on that file you've linked).

@SimonSapin
Copy link
Author

You can always use a blank line if you want, I don’t know if it has to be required. reStructuredText separates the indented text into parameters (before the first blank line) and content (after):

.. autoclass:: Request
   :members:

   .. attribute:: form

      A :class:`~werkzeug.datastructures.MultiDict` with the parsed form data from `POST`
      or `PUT` requests.  Please keep in mind that file uploads will not
      end up here,  but instead in the :attr:`files` attribute.

But I don’t think we need that at the Hoedown parser level. Specific can always parse the content with their own syntax.

@mildsunrise
Copy link
Member

So, rst requires a blank line to be present to separate the content from {directive, header, parameters}?

@SimonSapin
Copy link
Author

I don’t know, the general documentation for directive syntax doesn’t mention it, so it might be specific to some directives:

http://docutils.sourceforge.net/docs/ref/rst/directives.html

Directives have the following syntax:

+-------+-------------------------------+
| ".. " | directive type "::" directive |
+-------+ block                         |
        |                               |
        +-------------------------------+

@mildsunrise
Copy link
Member

Hmm, I understand...

@mildsunrise
Copy link
Member

But there's something I still don't understand: how should the parser treat the content inside directives?
Should it parse it as Markdown? Should it parse it looking for nesting directives?

@SimonSapin
Copy link
Author

It should pass the raw (dedented) source to the renderer callback. Additionally, the callback should have the ability to parse that content as Markdown (including nested directives), but it doesn’t have to.

@mildsunrise
Copy link
Member

Additionally, the callback should have the ability to parse that content as Markdown (including nested directives), but it doesn’t have to.

Well, that ability would require exposing the nuts and guts of the parser (aka parse_block and parse_inline) since calling hoedown_document_render when a render is in progress will result in undefined behaviour.

@mildsunrise
Copy link
Member

That, or the renderer should itself keep another hoedown_document to render content with.

@SimonSapin
Copy link
Author

There are apparently multiple possible approaches to parsing nested things for a renderer callback, but I don’t know enough about Hoedown’s internals to have an opinion on which is preferable.

@mildsunrise
Copy link
Member

?? AFAIK no nested parsing can / is currently done at the renderer.

@SimonSapin
Copy link
Author

Yes, that’s what I mean. Support for this would need to be added somehow.

@mildsunrise
Copy link
Member

And that would mean exposing internal parser methods publicly. I'm not sure about that.

@SimonSapin
Copy link
Author

I don’t know how this would be achieved.

@uranusjr
Copy link

If contents inside a directive block need to be parsed, we would need a parsing function to override, not a rendering one—and then you’ll have issues like #94, and the following would need to be answered:

  1. Should Hoedown make its parser extendable? If the answer is no, things end here.
  2. Are the current parser internals suitable to be made public as API? If yes, we’re good. But IMO the answer is definitely no.
  3. How much internals should be made public? parse_block and parse_inline would be enough to “parse the content as Markdown”, but many parsers expose quite a lot, including docutils, reST’s reference implementation.
  4. How should they be exposed? Simply make everything non-static is not a good idea, but we’ll need some way to abstract out at least some the parsing part to keep internal things internal. This is not impossible, as many other parsers do that, but much work would need to be done before we get to the custom directive part.

@mildsunrise
Copy link
Member

If contents inside a directive block need to be parsed, we would need a parsing function to override, not a rendering one—and then you’ll have issues like #94,

Yes, in the sense that you need access to Hoedown's internals.

Should Hoedown make its parser extendable? If the answer is no, things end here.

You're not really adding new syntaxes to the parser, you have a fixed, generic syntax and you're parsing the content inside that syntax -- be it Markdown or not. Pretty much like you'd parse a codeblock's text to highlight it.

So no, technically you don't need to extend the parser to make it parse roles, you need to override the renderer to parse your custom roles content. Not the best place to do it, but...

How much internals should be made public? parse_block and parse_inline would be enough to “parse the content as Markdown”

I think that would be enough. If closer integration is required, an extension is the right way to do it.

Are the current parser internals suitable to be made public as API? If yes, we’re good. But IMO the answer is definitely no.

They are internals. They aren't expected to be called by the user, directly. Doing that otherwise than in specific cases is error prone unless user knows how Hoedown works internally.

Yes, they are modular and IMO are ready to be exposed, but since they're internals, the user must know how to use them.

How should they be exposed? Simply make everything non-static is not a good idea, but we’ll need some way to abstract out at least some the parsing part to keep internal things internal. This is not impossible, as many other parsers do that, but much work would need to be done before we get to the custom directive part.

Really agree on that. And I'm afraid we'd be killing the simplicty of Hoedown's API in the process.

@mildsunrise
Copy link
Member

Specifically, calling a Hoedown internal is only acceptable if:

  1. We're in the context of a render, that is, inside a renderer callback.
  2. The data that we'll be passing to the internal also comes from Hoedown, i.e. from the callback parameters.

If these conditions apply, calling parse_block or parse_inline is safe and straighforward.

Calling a hoedown internal in any other case will result in undefined behaviour. (wrong output in the best case, crashing in the worst case)

@mildsunrise
Copy link
Member

I've found a problem while trying to implement this.
parse_block mutates the data as it parses it.

Therefore things like that:

void rndr_directive(hoedown_buffer *ob,  const hoedown_buffer *name,  const hoedown_buffer *header, const hoedown_buffer *text, void *opaque) {
    BUFPUTSL(ob, "<div class=\"directive-");
    hoedown_buffer_put(ob, name->data, name->size);
    BUFPUTSL(ob, "\">");

    parse_block(ob, document, text->data, text->size);

    BUFPUTSL(ob, "</div>");
}

Would fail since you cannot modify text, it's marked const. You'd have to create and maintain a temporal buffer and copy the data to it.

@uranusjr
Copy link

uranusjr commented Sep 5, 2014

Why does the parser need to mutate the input data? If there’s no good reason for it, I propose just change all parsing functions to take const uint8_t * instead. I just skimmed through document.c, and most usages of data that isn’t compatible with this are for setting the work buffer to be used in the renderer like this:

hoedown_buffer work = { NULL, 0, 0, 0, NULL, NULL, NULL };
work.data = data + some_offset;
work.size = some_size;
doc->md.some_rendering_callback(ob, &work, doc->md.opaque);

which can be easily refactored into

hoedown_buffer *work = hoedown_buffer_new(64);
hoedown_buffer_put(work, data + some_offset, some_size);
doc->md.some_rendering_callback(ob, work, doc->md.opaque);
hoedown_buffer_free(work);

And I think this makes the code more readable. I would work on this this weekend if there’re no issues. Shouldn’t be too much trouble.

@mildsunrise
Copy link
Member

Why does the parser need to mutate the input data? If there’s no good reason for it, I propose just change all parsing functions to take const uint8_t * instead.

I completely agree.

Just spent an hour changing all parsers, triggers and prefix checkers to accept const uint8_t * and luckily the only point where the data is actually mutated is in parse_blockquote, here.

It seems this was an optimization for performance (instead of copying, just move the data in place) but I think it's unacceptable. This will be changed.

Everything else is fine.

most usages of data that isn’t compatible with this are for setting the work buffer

This would be fixed with a cast. We know the callback won't modify the data, since it's passed a const hoedown_buffer *, so it should be safe.

Doing hoedown_buffer_new and putting the data there would decrease performance (allocation and copying takes time) and we don't really need it since the data isn't being modified.

@uranusjr
Copy link

uranusjr commented Sep 5, 2014

But wouldn’t you still be able to modify contents of uint8_t *data, even if it’s contained in a const hoedown_buffer *? For example:

void (*hrule)(hoedown_buffer *ob, void *opaque)
{
    ob->data[5] = '\0';
    // ...
}

I mean, obviously the bundled HTML renderer doesn’t do this, and custom renderers shouldn’t do it. But the point is that const hoedown_buffer * doesn’t make the data pointer inside it safe.

Another way to ensure data safety is to specifically declare an immutable buffer type:

struct hoedown_buffer {
    const uint8_t *data;
    size_t size;
    size_t asize;
    size_t unit;

    hoedown_realloc_callback data_realloc;
    hoedown_free_callback data_free;
    hoedown_free_callback buffer_free;
};

and use it instead of hoedown_buffer at the right places. But maybe I’m just overthinking this because it should not be modified, and the user should take the consequences if he/she does something like this.

@mildsunrise
Copy link
Member

Yes, I know you can modify the data pointed to by the buffer, but if you use any of hoedown_buffer_ functions on a const hoedown_buffer *, the compiler will warn about it.

Even if you don't pay attention to the warnings (or don't enable them), the assertion for buf->unit > 0 will fail and the program will crash. A buffer with unit = 0 is read-only.

@SimonSapin
Copy link
Author

@SimonSapin
Copy link
Author

The syntax is the origin message here was only a proposal. If the CommonMark folks settle on some syntax, it’d be preferable to align with that.

@mildsunrise
Copy link
Member

Hmm... I think we can leave this aside for a bit, while we center on finishing v3 and adopting CommonMark for v4.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants