-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use TinyMCE utilities to parse post content #426
Conversation
e4087e3
to
11dd68c
Compare
There is a lot that can be significantly improved here, but this PR has reached parity with the existing parsing code. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A bit dense to digest (as parsing tends to be), but this reads and looks to work quite well 👍 What would you say is the main advantage of this method of parsing vs. the current PegJS grammar? Consistency with TinyMCE behavior of the Editable component? Is the use of TinyMCE for parsing utilities merely for convenience since it's already available?
Just trying to clarify boundaries between the role of TinyMCE and the concept of a block generally i.e. a post could consist of blocks of which none depend on TinyMCE, at least in the current implementation.
Implement schemas for the kinds of markup that blocks can accept, rather than simply querying the DOM for attributes.
Do you think there's a path here where we have a query API like the one that exists currently, but operating on the node produced by TinyMCE's parser? Trying to think of ways we can balance requirements of parsing robustness, flexibility of supporting variations of markup (e.g. with/without caption, citation), and developer experience generally.
If not, what else do you have in mind? Fine to defer this discussion to a separate issue also. Maybe composing an object by hooking into specific node types in a parser stream? Accessing values on the parsed tree a la XPath / E4X ? The latter is a neat case since React's (JSX's) object-to-HTML model is partly based on E4X, so this could serve as a natural completion of the other half (HTML-to-object). This last link also leads to JXON, which has some ideas around representing XML as a JavaScript object, which could be combined with a Lodash _.get
-like path lookup for transforming HTML to an object.
); | ||
|
||
// Create a custom HTML schema | ||
const schema = new tinymce.html.Schema(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I forget why we chose to reference the global directly for Editable, but do you think it'd make sense to import TinyMCE as a module here to avoid the need for ProvidePlugin
in the tests? I suppose in order to avoid including TinyMCE into the generated bundle we'd still need to modify the Webpack configuration to externalize TinyMCE (as we do with React currently).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assumed the idea behind referencing the global directly in Editable was that as a plugin we wanted to use the TinyMCE version that's bundled with WordPress. I wanted to preserve that behavior here.
We could import it as a module here, but as you say, it looks like we'll need to modify the Webpack configuration either way, and I'd prefer to keep usage consistent in the code (currently, use the global version).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assumed the idea behind referencing the global directly in Editable was that as a plugin we wanted to use the TinyMCE version that's bundled with WordPress.
Right, and we can do that with Webpack externals. I'm fine to leave this for a separate pull request.
// In : <!-- wp:core/embed url:youtube.com/xxx& --> | ||
// Out : <wp-block slug="core/embed" attributes="url:youtube.com/xxx&"> | ||
content = content.replace( | ||
/<!--\s*(\/?)wp:([a-z0-9/-]+)((?:\s+[a-z0-9_-]+:[^\s]+)*)\s*-->/g, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TIL about ?:
. Handy.
Do you have any thoughts on if and how we would support spaces in an attribute value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes - on that front, here is what I would like to see happen next:
- Make it possible to extend TinyMCE's parser to recognize our comment delimiters. Get rid of this hack to turn them into temporary tags.
- Change the attribute format in our comment delimiters to match regular HTML tags:
<!-- wp:core/block attr="something with spaces" -->
. This is more standard and will be much easier for TinyMCE's parser to deal with.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think about just using JSON for the comment attributes? Seems even easier to parse.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
JSON would definitely be cleaner, and perhaps easier to parse when starting from nothing, but when reusing TinyMCE's parsing that is already working well for HTML tags I'm not sure.
blocks/api/parser.js
Outdated
@@ -45,31 +44,150 @@ export function getBlockAttributes( blockNode, blockSettings ) { | |||
return attrs; | |||
} | |||
|
|||
function htmlEscape( str ) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we have it, maybe defer to Lodash's equivalents (also in Underscore if we end up going that route)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in 9866944.
Ability to deal with more kinds of markup gracefully. For one example, if block delimiters get inserted into the middle of a block, we will be able to recover from that situation much better because we now parse the entire tree of post content in a single step. Previously, this would lead to block markup being split, TinyMCE parsing would kick in when we
This is also a big benefit - we're now using an algorithm/library to parse WP post content that has years of extensive real-world usage and testing behind basically this exact task (minus our new delimiters).
I don't see a problem with this. If needed, it should be possible to make just the parsing logic available separately from the rest of TinyMCE, and we should continue to make sure the actual implementation of a block doesn't depend on TinyMCE. |
It's hard for me to see how this will work out in practice as well. I am definitely open to suggestions. I'll illustrate one of the main reasons I wanted to pursue this approach via an example:
What happens when we edit such a block? Can we guarantee that we'll allow a reasonable editing experience and preserve the HTML content on save? Why is this important? Granted, you probably shouldn't directly edit something inside a There is also a bigger reason: if we can make this work really well, we can apply the same logic and user experience (validate markup, otherwise fall back to
Here is a very rough idea that would cover the example above. import { OneOrMore, HTML, Param } from 'blocks/schema';
registerBlock( 'core/text', {
schema: (
<OneOrMore>
<p styleAlign={ <Param name="align" /> }>
<HTML
paramName="content"
allowedTags=[ 'strong', 'em', 'b', 'i', 'span', ... ]
/>
</p>
</OneOrMore>
), ...
} ); I'm not crazy about the
I think it also has the potential to be a very nice developer experience: rather than having to implement their own querying/validation/serialization logic, block implementers can specify declaratively what kinds of markup they support and how to parse it into the parameters they need, and the syntax looks a lot like the HTML that the blocks are designed to output. General thoughts? Where do you see something like this working well or not? There is a lot to be done here, and it probably won't work for all use cases. I would also probably start implementing this with a simpler block, as handling arrays of tags will probably be fairly tricky. |
04739ae
to
4554985
Compare
Note that we only need this for the HTML parsing code.
Also adds a test for attribute parsing.
I've rebased and done some further refactoring as of 4554985:
|
In your proposal, when covering the case that a text block has foreign content (images in your example), what would happen? I suppose if it fails to pass the schema, it simply defers to the fallback handler? I do quite like how it unifies the parsing and serialization steps. A few aspects I'm not so certain of:
I see the appeal in being explicit about where variations can occur, rather than blindly querying into the DOM / markup and potentially finding a value. Using JSX Is neat but maybe not entirely necessary (might be an alternative with a plain JavaScript object description or string paths). I've avoided it mainly because of XML, but perhaps if we are to go the JSX route, there's some inspiration to be taken from XSLT (mapping between JSX and a JS object instead of HTML and XML). |
Another interesting variation is on the node types themselves. For example, the heading block will need to have its level as an attribute, and the list whether it is ordered or unordered. These are driven by the nodeName in the markup, so we'd need some way of schematizing variable child types. |
Yes. As far as the other points you raised, you've definitely picked up on some of the concerns I've had with this as well...
In this case they would have to be arrays. Maybe a bit counterintuitive.
I would expect that it would accept arrays and verify that they have the same length.
Ignore the
This could be any of the types you mentioned in the comment you linked (string, object tree...), this is definitely something that needs further thought.
Definitely, @iseulde has suggested something like this too (though still need to figure out how to parameterize it, and block implementers would have to learn more in order to use it): [ 'figure', {},
[ 'img', { alt: 'Alternative text.', src: 'https://example.com' } ],
[ 'figcaption', {}, 'This is ', [ 'em', {}, 'some' ], ' caption.' ]
]
Yes, maybe something like |
@nylen would it be possible to build this and merge it without replacing the existing one? It'd be great if we could just switch between one and the other at will to compare them. |
@mtias good idea, done. Both parsers are present and can be switched out by editing the The PegJS-based approach is very simplistic and will not support any of the follow-up steps we've discussed here unless it receives significant enhancements along the lines of https://github.com/Automattic/wp-post-grammar. Even then, TinyMCE already does much of the work we will need for us. |
can you link to the follow-up steps you think it can't handle? it is a very simple approach but it's also incredibly capable. |
I didn't say "can't handle", rather "will not support without significant enhancements". A couple of the things we will have to build:
I'd like to merge this and keep moving. |
this is where I'm uneasy. we have to make a choice when it's invalid and right now we just leave it up to TinyMCE whose behavior is completely unspecified. regardless, maybe the markup is intended on being bad (terrible idea, but who knows… maybe they are trying to demonstrate a browser quirk) - well, try as hard as they might, TinyMCE botches up what they actually intended. this behavior is present with I'm happy to preserve backwards compatibility, but I don't think we should bring over the worst part of that: implicitly making judgments about code whose meaning is impossible to infer. if the text comes into a block it must be valid to whatever extent valid means. I also want a no-edit session to be an identity of the input.
what's so good about unspecified behavior anyway? wouldn't it be better to simply notify that something has become invalid and provide an undo button to revert the breaking change? just not allow it to be possible to create invalid content |
"completely unspecified" - "impossible to infer". These statements are a bit too strong: https://github.com/tinymce/tinymce/blob/542c74d/src/core/src/main/js/html/SaxParser.js#L163 Remember also that we're building a WordPress editor that we plan to have millions of people use to edit their existing content. We don't necessarily have the luxury of making choices like that... |
they aren't, and they aren't a jab or insult either
having an implementation in code is just that - an implementation. it tells us what it does but not what the goal is. even in the linked code most of it is dealing with a stack structure. it's very difficult to build an alternative implementation matching the TinyMCE parser because it doesn't distinguish between "this is how we happen to be doing it" and "this is what it needs to do"
what is the meaning of
the problem is that because the document is ambiguous there is no possible way to determine which of several alternatives was meant. we have to make assumptions and make choices but those choices aren't specified (unless we lay out formal rules that say what to do in all of these invalid situations) that's just a simple example though how about the point is that if we're going for blocks and have the opportunity to push all edits inside a defined editor (leaving the old TinyMCE block as a type of block for handling backwards compatibility) then why enable storing ambiguity and non-determinism into a post?
Not sure which choices in particular you are referring to, but none of us are trying to take away existing content. Even proper schemas won't help existing content because it's just HTML + Shortcode stew. The only meaningful way to upgrade a block from soup to something specific is if we have a valid parse be it via a schema or JSON structure or whatever… |
Let's back up a bit... I would consider a grammar-based parser to be a superior solution. However what I implemented here was simply the easiest way to get a parse tree out of post content in a way that I know matches existing WP behavior. There are undoubtedly a ton of other edge cases there, and unfortunately I'm not the person to ask about what they are. @azaozz and @iseulde come to mind as people who would know about what kinds of content we need to be especially careful with when parsing. This PR is a first (second?) draft that will serve well for building on top of. Even after this, it's not set in stone and it's really easy to switch out the parser. Let's replace it with something more robust as soon as we can. We could use your help there as I'm also not an expert in this area. |
@spocke probably knows best which things TinyMCE's SAX parser does handle that others (even DOM parser) do not. |
return attrs; | ||
}, {} ) } | ||
|
||
WP_Block_Attribute | ||
= name:WP_Block_Attribute_Name ":" value:WP_Block_Attribute_Value | ||
{ return [ name, value ] } | ||
{ return { name, value }; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's an issue where this file isn't transpiled, and UglifyJs chokes on this line. You can see the error when running npm run build
. The workaround was just to use ES5 in the file, otherwise we need to investigate how to apply Babel to the file.
This PR is intended to start addressing #391 and is also closely related to the discussion at #413. I've replaced the PegJS grammar parsing with a way of parsing post content into a tree using TinyMCE's parsing utilities.
TinyMCE already uses its internal parsing utilities in our implementation, whenever
setContent
is called. So this shouldn't cause issues related to multiple parsing implementations having conflicting behavior.For now, to preserve compatibility with existing blocks, this code parses the content, serializes blocks to strings, then passes these strings off to the individual blocks, much like what happened before.
Things that need to be done before this PR is not breaking existing functionality:
freeform
blockNext steps using this work as a foundation:
setContent
that accepts a tree rather than a string; this will mean that we are only parsing content once.)freeform
(or the next block that validates against the schema?) if there is unexpected markup present.