-
Notifications
You must be signed in to change notification settings - Fork 275
Writing Extensions
Extensions need to extend the parser, or the HTML renderer, or both. To use an extension, the builder objects can be configured with a list of extensions. Because extensions are optional, they live in separate artifacts, so additional dependencies need to be added as well.
The best way to create an extension is to start with a copy of an existing one and modify the source and tests to suit the new extension.
ℹ️ Thanks to Alex Karezin for setting up Flexmark Architecture and Dependencies Diagrams and https://sourcespy.com. You can now get an overview of module dependencies with ability to drill down to packages and classes, updated from the repository sources. 👍
Parsing proceeds in distinct steps with ability add custom processing at each step.
-
Text in the source is broken up into Block nodes. Block parsers decide whether the current line begins a block they recognize as one they create. Any lines not claimed by some block parser are claimed by the core
ParagraphParser
.A block parser can claim lines currently accumulated by the paragraph parser with the ability to replace it as the currently active parser.
-
Paragraph blocks are processed to remove leading lines that should not be processed as text. For example link references are processed and removed from the paragraphs. Only full lines should be processed at this step. Partial removal of text should be done at the next step.
The inline parser instance is available to blocks at this time to allow them to use the API to help process the text. Processors that require inline processing of their contents should be run after the core reference link paragraph processor, otherwise some link refs will not be recognized because they are not defined yet. A custom processor factory should return
ReferencePreProcessorFactory.class
fromgetAfterDependents()
. Similarly, if a paragraph processor needs to run before another pre-processor then it should return the processor's factory class fromgetBeforeDependents()
.Any circular dependencies will cause an
IllegalStateException()
to be thrown during builder preparation stage.The paragraph pre-processing is divided into stages of paragraph pre-processors to allow for dependencies to be respected.
Paragraph pre-processors can also report that they
affectsGlobalStat()
, which means that some document properties are affected by the result of their processing. For example,ReferencePreProcessorFactory
does so because reference link definitions affect theREFERENCES
node repository.Since a global processor will be run on all the paragraphs in the document before one of its dependents is allowed to process any paragraphs, global processors will be the only processor in their respective pre-processing stage.
Non global processors within the same stage will be run sequentially on each paragraph block until no more changes to the paragraph are made. This means that non-global processors of the same stage are allowed to have intermixed content, while global ones will only be run once for each paragraph. Non-global processors dependent on one or more global processors will be run at the first available stage after all their global dependencies have completed processing.
The order of pre-processors within a stage will be based on dependencies between processors and where not so constrained on the registration order of their corresponding extensions.
⚠️ It is best to implement the desired customization by using block parsers rather than paragraph pre-processors. Use the latter only if proper interpretation is not possible without using the inline parser API. Using the inline parser API during block parsing is a serious performance issue. -
Block pre processing is strictly for custom processors. At this point block nodes can be added, replaced or removed. Any other nodes can also be added to the AST, however no inline blocks have been created at this point.
Node creation and removal should be communicated to the
ParserState
instance via its implementation of theBlockParserTracker
andBlockTracker
interfaces. This is necessary to allow the internal parser optimization structures to be updated so that further block pre-processing will proceed correctly. -
During inline processing each block is given the chance to process inline elements contained in its text node or nodes. There are two types of customizations available at this step: link ref processing and delimiter processing. Delimiters are runs of text that have a start and end character to determine their span. Delimiters may be nested and have a minimum and maximum level of nesting.
Link ref processors are responsible for processing custom elements that are recognized as possible link refs, ie. they are delimited by
![
or[
and terminated by]
. Link ref processors determine whether brackets can be nested, whether the!
should be processed as text or part of their node and determine whether they accept the potential link ref text as their node. Full text,![]
or[]
is given for approval to give maximum flexibility on handling contained white space.Footnote
[^footnote]
and Wiki link[[wiki link]]
extensions make use of ths extension mechanism. -
Post processing step is for final AST modifications. Post processors come in two varieties: node and document post processors. Although the
PostProcessor
interface is used for both types, a post processor can only be one or the other.Node post processors specify the classes of nodes they want to post process, with ancestor exclusion criteria. The
process(NodeTracker, Node)
function of the processor will be called for every AST node that matches the given criteria.Any modifications to the AST must be communicated to the
NodeTracker
instance, which is responsible for updating the internal structures used to optimize node selection for processors.Specifically, each new node added or moved in the AST hierarchy will need to have its ancestor list updated for further node post processing. These notification functions should be called after the particular changed hierarchy change is complete to eliminate unnecessary updates for intermediate AST changes.
Nodes that contain child nodes which are new or have been moved from their previous parents need to be notified via the
nodeAddedWithChildren(Node)
, rather than usingnodeAdded(Node)
callback for each individual node. Similarly, greater depth changes should be communicated vianodeAddedWithDescendants(Node)
notification.Complete node removals should be communicated via
nodeRemoved()
function after all its child nodes that need to be moved elsewhere have been removed.⚠️ All node removal functions will perform node removal of the node and all its descendants since any child nodes of an unlinked node are removed from the AST.Document post processors are invoked using the
processDocument(Document)
member function and the returned document will be used as the document for further processing.Document processors are responsible for finding nodes of interest by recursively traversing the AST. For this reason, using a document post processor should only be done when processing cannot be done on individual nodes.
Although, traversing the AST for one extension is faster than creating, maintaining and accessing the optimization structures in the parser, doing this with just two extensions on a large document is a much slower process.
This performance gain is especially true for extensions that exclude nodes based on their ancestor types in the AST. For node post processors this hierarchy is determined during a single traversal of the AST to build all the node tracking structures. If the extension determines inheritance by looking back at the
getParent()
function of a node this becomes very inefficient on large documents. -
HTML rendering step. Renders the final AST. Extension provide default renderers for their custom nodes. Rendering for any node can be customized by replacing the default renderer or through an attribute provider that will override HTML element attributes for default renderers. LinkResolvers are responsible for converting the link url text from the text in the markdown element to the rendered URL.
-
Include File Support allows extensions to copy their custom reference defining elements from included document to the including document so that any included custom elements requiring definition of these references will be resolved correctly before rendering. This is an optional step that should be performed by the application after parsing the document and before rendering it. See Include Markdown and HTML File Content
To track source location in the AST, all parsing is performed using BasedSequence
class which
extends CharSequence
and wraps the original source character sequence of the document with
start and end offsets to represent its own contents. subSequence()
returns another
BasedSequence
instance with the original base sequence and new start/end offsets.
In this way the source location representing any string being parsed can be obtained using the
BasedSequence.getStartOffset()
and BasedSequence.getEndOffset()
. At the same time parsing is
no more complicated than working with CharSequence
. Any string stored in the AST has to be a
subSequence()
of the original source. This constraint makes sense since the AST represents the
source.
The fly in the ointment is that parsing unescaped text from the AST is a bit more involved since
it is the escaped original which must be added to the AST. For this all methods in the
Escaping
utility class were added that take a BasedSequence
and a ReplacedTextMapper
class. The returned result is a modified sequence whose contents can be mapped to the original
source using the methods of the ReplacedTextMapper
object. Allowing parsing of massaged text
with ability to extract un-massaged counterpart for placement in the AST. See implementation in
the flexmark-ext-autolink
AutolinkNodePostProcessor for an example of how this is achieved
in a working extension.
Similarly, when using regex matching you cannot simply take the string returned by group()
but
must extract a subSequence
from the input using the start/end offsets for the group. The best
way to do this is to take a subSequence()
from the original sequence that was used to create
the matcher()
, this eliminates errors in offset computation. Examples of this are abundant in
the core parser implementation.
The overhead is a small price to pay for having complete source reference in the AST and ease of parsing without having to carry a separate state to represent source position or use dedicated grammar tools for the task.
Source tracking in the core was complicated by leading tab expansion and prefix removal from
parsed lines with later concatenation of these partial results for inline parsing, which too
must track the original source position. This was addressed with additional BasedSequence
implementation classes: PrefixedSubSequence
for partially used tabs and SegmentedSequence
for concatenated sequences. The result is almost a transparent propagation of source position
throughout the parsing process.
ℹ️ Implementation of SegmentedSequence
was changed in version 0.60.0
to
use binary search tree for segment containing character index and per thread caching and
optimizations to eliminate the need to search for a segment in almost all sequential charAt
access, including forward and backward sequence scans from beginning or end of the sequence. The
small penalty for charAt
is outweighed by reduction of overhead for segmented sequences from
200% in the old implementation to an average of 5.7% in the new implementation. Additionally,
when the segments represent full lines, like they do in the parser, there is no perceived or
measurable penalty for charAt
access. On a test with 1000+ files and a total parse time for
all files being 2.8 seconds, the penalty for using binary tree segmented sequences is less than
50ms and is within the error margin of doing such simple timing measurements.
Construction of SegmentedSequence
was also changed to use BaseSequenceBuilder
instance which
eliminates the concern for having the segments in the right order and having to use
PrefixedSubSequence
for all non based text. Just get a builder instance from any
BasedSequence.getBuilder()
and invoke its methods to append char
or CharSequence
. It will
generate an optimized segment list, converting any overlapping or out of order base sequence
ranges to inserted text. When done, invoke BasedSequenceBuilder.toSequence()
to get an
optimized based sequence that is equivalent to all appended text.
The builder is efficient enough to be used as an Appendable
for accumulating text for
Formatter
and HtmlRenderer
with a performance penalty factor less than 8. It may sound like
a high price but the benefit is being able to take formatted or rendered HTML result and extract
source position information for any text which came from the original file. If you need this
information a factor of 7.4 is acceptable especially when having it generated requires no extra
effort beyond using BasedSequenceBuilder
as the appendable.
A generic options API was added to allow easy configuration of the parser, renderer and
extensions. It consists of DataKey<T>
instances defined by various components. Each data key
defines the type of its value, a default value and optionally .
The values are accessed via the DataHolder
and MutableDataHolder
interfaces, with the former
being a read only container. Since the data key provides a unique identifier for the data there
is no collision for options.
Parser.EXTENSIONS
option holds a list of extensions to use for the Parser
and HtmlWriter
.
This allows configuring the parser and renderer with a single set of optioins.
To configure the parser or renderer, pass a data holder to the builder()
method.
public class SomeClass {
static final DataHolder OPTIONS = new MutableDataSet()
.set(Parser.REFERENCES_KEEP, KeepType.LAST)
.set(HtmlRenderer.INDENT_SIZE, 2)
.set(HtmlRenderer.PERCENT_ENCODE_URLS, true)
.set(Parser.EXTENSIONS, Arrays.asList(TablesExtension.create()))
.toImmutable();
static final Parser PARSER = Parser.builder(OPTIONS).build();
static final HtmlRenderer RENDERER = HtmlRenderer.builder(OPTIONS).build();
}
In the code sample above, ReferenceRepository.KEEP
defines the behavior of references when
duplicate references are defined in the source. In this case it is configured to keep the last
value, whereas the default behavior is to keep the first value.
The HtmlRenderer.INDENT_SIZE
and HtmlRenderer.PERCENT_ENCODE_URLS
define options to use for
rendering. Similarly, other extension options can be added at the same time. Any options not set
will default to their respective defaults as defined by their data keys.
All markdown element reference types should be stored using a subclass of NodeRepository<T>
as
is the case for references, abbreviations and footnotes. This provides a consistent mechanism
for overriding the default behavior of these references for duplicates from keep first to keep
last.
By convention, data keys are defined in the extension class and in the case of the core in the
Parser
or HtmlRenderer
.
0.60
, the DataHolder
argument passed to the
DataValueFactory::create()
method will be null
when creating a read-only default value
instance for use by the key. The class constructor should be able to handle this case
seamlessly. To make it convenient to implement such classes, use the
DataKey.getFrom(DataHolder)
method instead of the DataHolder.get(DataKey)
method to access
the values of interest. The former will provide the key's default value if the data holder
argument is null
, the latter will generate a run time java.lang.ExceptionInInitializerError
error.
For Option data keys for the Parser
see Extensions: Parser
For Option data keys for the HtmlRenderer
see Extensions: Renderer
-
PhasedNodeRenderer
andParagraphPreProcessor
interfaces were added with associatedBuilder
methods for extending the parser.PhasedNodeRenderer
allows an extension to generate HTML for various parts of the HTML document. These phases are listed in the order of their occurrence during document rendering:HEAD_TOP
HEAD
HEAD_CSS
HEAD_SCRIPTS
HEAD_BOTTOM
BODY_TOP
BODY
BODY_BOTTOM
BODY_LOAD_SCRIPTS
BODY_SCRIPTS
BODY
phase is the standard HTML generation phase using theNodeRenderer::render(Node ast)
method. It is called for every node in the document.The other phases are only called on the
Document
root node and only for custom renderers that implement thePhasedNodeRenderer
interface. ThePhasedNodeRenderer::render(Node ast, RenderingPhase phase)
.The extension can call
context.render(ast)
andcontext.renderChildren(ast)
during any rendering phase. The functions will process the node as they do during theBODY
rendering phase. TheFootnoteExtension
uses theBODY_BOTTOM
phase to render the footnotes referenced within the page. Similarly, Table of Contents extension can use theBODY_TOP
phase to insert the table of contents at the top of the document.The
HEAD...
phases are not used by any extension but can be used to generate a full HTML document, with style sheets and scripts. -
CustomBlockParserFactory
,BlockParserFactory
andBlockParser
are used to extend the parsing of blocks that handle partitioning of the document into blocks, which are then parsed for inlines and post processed. -
ParagraphPreProcessor
andParagraphPreProcessorFactory
interfaces allow customization of pre-processing of block elements at the time they are closed by the parser. This is done by theParagraphParser
to extract leading reference definition from the paragraph. Special handling ofParagraphParser
block was removed from the parser and instead a generic mechanism was added to allow anyBlockParser
to perform similar functionality and to allow adding custom pre-processors to handle elements other than the built in reference definitions. -
BlockPreProcessor
andBlockPreProcessorFactory
interfaces allow pre-processing of blocks afterParagraphPreProcessor
instances have run but before inline parsing is performed. Useful if you want to replace a standard node with a custom one based on its context or children but not inline element information. Currently this mechanism is not used. May be removed in the future if it does not prove to be useful. -
Document level, extensible properties were added to allow extensions to have document level properties which are available during rendering. While parsing these are available from the
ParserState::getProperties()
,state
parameter and during post-processing and rendering from theDocument
node reachable viagetDocument()
method of anyNode
.The
DocumentParser
andDocument
properties will also contain options passed or defined on theParser.builder()
object, in addition to any added in the process of parsing the document.⚠️ HtmlRenderer
options are only available on the rendering context object. NodeRenderer extensions should check for their options using theNodeRendererContext.getOptions()
not thegetDocument()
method. IfHtmlRenderer
was customized with options which were not passed toParser.Builder
then these options will not be available through the document properties. The node renderer context options will contain all custom options defined forHtmlRenderer.builder()
and all document properties, which will contain all options passed to theParser.builder()
plus any defined during the parsing process. If an option is customized or defined in the renderer, its value from the document will not be accessible. For these you will need to use the document available through the rendering contextgetDocument()
method.DataKey
defines the property, its type and default value instantiation.DataHolder
andMutableDataHolder
interfaces are used to access or set properties, respectively.NodeRepository
is an abstract class used to create repositories for nodes: references, footnotes and abbreviations. -
Since the AST now represents the source of the document not the HTML to be rendered, the text stored in the AST must be as it is in the source. This means that all un-escaping and resolving of references has to be done during the rendering phase. For example a footnote reference to an undefined footnote will be rendered as if it was a Text node, including any emphasis embedded in the footnote id. If the footnote reference is defined it will render both as expected.
Handling disparate end of lines used in the source. It too must now be handled in the rendering phase. This means that text which contains end of lines must be normalized before it is rendered since it is no longer normalized during parsing.
This extra processing is not difficult to implement since the necessary member methods were added to the
BasedSequence
class, which used to represent all text in the AST. -
Nodes do not define
accept(Visitor)
method. Instead visitor handling is delegated viaVisitHandler
instances andNodeVisitor
derived classes.
Unified options handling was added which are also can be used to selectively disable loading of core processors for greater customization.
Parser.builder()
now implements MutableDataHolder
so you can use get
/set
to customize
properties.
Parser.builder()
now implements MutableDataHolder
so you can use get
/set
to customize p
New extension points for the parser:
-
ParagraphPreProcessor
is used by theParagraphBlock
to extract reference definitions from the beginning of the paragraph, but can be used by any other block for the same purpose. Any custom block pre-processors will be called first, in order of their registration. Multiple calls may result since removal of some text can expose text for another pre-processor. Block pre-processors are called until no changes to the block are made. -
InlineParserFactory
is used to override the default inline parser. Only one custom inline parser factory can be set. If none are set then the default will be used. -
LinkRefProcessor
is used to create custom elements that syntactically derive from link refs:[]
or![]
. This will work correctly for nested[]
in the element and allows for treating the leading!
as plain text if the custom element does not use it. Footnotes ([^footnote ref]
) and wiki links ([[]]
or[[text|link]]
) are examples of such elements.
An option to have blank lines appear as BlankLine
nodes in the AST requires that any custom
Block nodes which you want to contain blank lines to implement BlankLineContainer
interface.
It has no methods at this time and is only used to identify which blocks can contain blank
lines.
Classes inheriting custom nodes from ListBlock
and ListItem
will automatically inherit the
blank line handling except for moving blank lines at the end of block to the parent when the
block is closed. This should be done by calling the Node.moveTrailingBlankLines()
method on
the node.
Unified options handling added, existing configuration options were kept but now they modify the corresponding unified property.
Renderer Builder()
now has an indentSize(int)
method to set size of indentation for
hierarchical tags. Same as setting HtmlRenderer.INDENT_SIZE
data key in options.
All the HtmlWriter
methods now return this
so method chaining can be used. Additionally,
tag()
and indentedTag()
methods that take a Runnable
will automatically close the tag, and
un-indent after the run()
method is executed. This makes seeing the HTML hierarchy easier in
the rendered output.
Instead of writing out all the opening, closing tags and attributes individually:
class CustomNodeRenderer implements NodeRenderer {
public void render(BlockQuote node, NodeRendererContext context, HtmlWriter html) {
html.line();
html.tag("blockquote", getAttrs(node));
html.line();
context.renderChildren(node);
html.line();
html.tag("/blockquote");
html.line();
}
}
You can combine them and use a lambda to render the children, that way indentation and closing tag is handled automatically:
class CustomNodeRenderer implements NodeRenderer {
public void render(BlockQuote node, NodeRendererContext context, HtmlWriter html) {
html.withAttr().tagIndent("blockquote", (Runnable)()-> context.renderChildren(node));
}
}
For the penalty of increased stack use the added benefits are:
- indenting child tags
- attributes are easier to handle since they only require setting the attributes with
.attr()
and using.withAttr()
call before thetag...()
method - tag is automatically close The previous behavior of using explicit attribute parameter is still preserved.
The indentation useful for testing because it is easier to visually validate and correlate:
> - item 1
> - item 2
> 1. item 1
> 2. item 2
the rendered html:
<blockquote>
<ul>
<li>item 1</li>
<li>item 2
<ol>
<li>item 1</li>
<li>item 2</li>
</ol>
</li>
</ul>
</blockquote>
than this:
<blockquote>
<ul>
<li>item 1</li>
<li>item 2
<ol>
<li>item 1</li>
<li>item 2</li>
</ol>
</li>
</ul>
</blockquote>
You can get a renderer sub-context from the current rendering context and set a different html
writer with a different do not render links settings. Use TextCollectingAppendable
and pass it
to the NodeRendererContext.getSubContext()
method when you need to capture html from nodes.
TocExtension uses this to get the header text html, but without any links which may be embedded
in the heading.
The flexmark-formatter
module renders the AST as markdown with various formatting options to
clean up and make the source consistent. This also comes with an API to allow extensions to
provide and handle formatting options for custom nodes.
Default rendering for Block nodes will render the children with any prefix/suffix characters defined for the node. Non-block nodes will simply render their character content as is.
To provide custom formatting, Implement FormatterExtension
and register a
NodeFormatterFactory
for your node formatter. The behavior and API is similar to
NodeRenderer
except markdown is expected and the FormattingPhase
is adapted to Markdown
documents instead of HTML.
Best source of sample code is existing extensions that implement custom node formatting and the formatter module itself:
flexmark-ext-abbreviation
flexmark-ext-definition
flexmark-ext-footnotes
flexmark-ext-jekyll-front-matter
flexmark-ext-tables
flexmark-formatter