Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix block validation: wrap in exception handler for malformed HTML #8304

Merged
merged 7 commits into from
Aug 21, 2018

Conversation

johngodley
Copy link
Contributor

If you paste malformed HTML into a block the HTML tokenizer can break.

The underlying bug is in the simple-html-tokenizer package. Ideally it will be fixed there, but in the meantime (and also to protect against other unknown errors) this PR wraps the tokenizer in an exception handler.

The fixed behaviour is that the block will show an invalid block warning message, allowing it to be cleaned up in a controlled way.

As far as I can tell this is the only place in Gutenberg where simple-html-tokenizer is used and so I haven't tried to generally wrap the library.

How has this been tested?

An additional unit test has been added which tests the breakage. You can manually verify the problem by:

  • Open developer console
  • Add a quote block
  • Edit HTML of the quote block and replace the content with <blockquote class="wp-block-quote">fsdfsdfsd<p>fdsfsdfsdd</pfd fd fd></blockquote>
  • Look in developer console for this error:
    error
  • Verify that after applying this PR the error is no longer shown and an invalid block warning is shown

Types of changes

Non breaking bug fix

Checklist:

  • My code is tested.
  • My code follows the WordPress code style.
  • My code follows the accessibility standards.
  • My code has proper inline documentation.

@johngodley johngodley added [Feature] Block API API that allows to express the block paradigm. [Feature] Blocks Overall functionality of blocks labels Jul 30, 2018
@johngodley
Copy link
Contributor Author

@aduth, I think you worked on this bit of code so dragging you in for thoughts again!

Copy link
Member

@aduth aduth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice find. In 883687d I pushed a failing test case. When actual and expected are both malformed, the function will wrongly return true since there's nothing to iterate from the empty return array of getHTMLTokens.

johngodley and others added 5 commits August 12, 2018 08:58
If you paste some malformed HTML into a block the HTML tokenizer can break. Wrapping the tokenizer in an exception handler means we can control the error
Sentences ending in periods. Empty newline between description and parameters. Precise return type.
Add a check if both strings are invalid HTML
Empty strings are considered equivalent
@johngodley johngodley force-pushed the fix/invalid-html-crash branch from 883687d to 192160e Compare August 12, 2018 09:39
@johngodley
Copy link
Contributor Author

Good catch. Changed the logic so it will fail if one or both strings are invalid.

*
* @param {string} html HTML string to tokenize.
*
* @return {Object[]|boolean} Array of valid tokenized HTML elements, or false
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The documentation should describe in what cases the consumer would expect the function to return false. Currently it's not clear that this is tied to capturing errors.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aside: Why was false chosen, vs. another more semantically meaningful value for empty like null or undefined ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why was false chosen, vs. another more semantically meaningful value for empty like null or undefined

Force of habit from PHPland. Swapped to null

@@ -390,7 +409,12 @@ export function getNextNonWhitespaceToken( tokens ) {
*/
export function isEquivalentHTML( actual, expected ) {
// Tokenize input content and reserialized save content
const [ actualTokens, expectedTokens ] = [ actual, expected ].map( tokenize );
const [ actualTokens, expectedTokens ] = [ actual, expected ].map( getHTMLTokens );
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm inclined to think this change could be simplified to something like:

try {
	const [
		actualTokens,
		expectedTokens,
	] = [ actual, expected ].map( tokenize );
} catch ( error ) {
	return false;
}

Logging the warning about the specific string becomes a bit trickier. We could still have getHTMLTokens which catches the specific failure, logs, then throws up to the catch here.

Or we could keep as-is. My only thought was avoiding the overloaded return value, where false is a bit of a semantically ambiguous value.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK the const's are scoped to the try block so I don't think is possible unless they were converted to let and defined outside, and it seemed better to keep the constness.

I've cleaned it up a bit already, and happy to make a further change, but otherwise will go as-is.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call. I guess the corrected form would look something closer to:

let actualTokens, expectedTokens;
try {
	( [
		actualTokens,
		expectedTokens,
	] = [ actual, expected ].map( tokenize ) );
} catch ( error ) {
	return false;
}

Which starts to be a bit harder to read 😬 Good as it is.

Copy link
Member

@aduth aduth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. I think we should wait until after the 3.6 release to merge this one. Also could use an update to the return type.

*
* @param {string} html HTML string to tokenize.
*
* @return {Object[]|boolean} Array of valid tokenized HTML elements, or null on error
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't return a boolean type anymore.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doh.

👍 on 3.6

@aduth aduth added this to the 3.7 milestone Aug 16, 2018
@aduth
Copy link
Member

aduth commented Aug 20, 2018

In some final testing, I happened to stumble upon an interesting issue.

  1. Mangle a block so that it becomes invalid HTML which would throw at this step
  2. Observe (as intended with these changes) that the editor doesn't break, it presents the "modified externally prompt"
  3. Press "Keep as HTML"
  4. See your new HTML block with the markup preserved
  5. Press Save
  6. Reload the editor
  7. Observe: The same "modified externally" warning

I'd have expected that since I already chose to "Keep as HTML", I would no longer be presented with this prompt. I assume what's happening is that the malformed markup is triggering the invalidation on the HTML block.

I'm not entirely sure how we accommodate this:

  • Present a different initial message, not reusing "modified externally" but being more clear about the HTML being invalid?
    • If so, what options do we present?
  • Ignore validation on specific block types?
    • It doesn't really seem sensible that we care what type of markup the user introduces into their HTML block.

Thoughts?

I'm also open to merging this one as-is, since the current state is an improvement over the previous error.

@johngodley
Copy link
Contributor Author

Ah yes, interesting!

I tried it with the previous code and the editor just white-screens. I'll merge this as an improvement over the current situation, and file a ticket for the invalid-on-load separately.

A different error message sounds a good idea regardless. I also like the idea of not validating certain blocks - if the user wants the invalid HTML then that's fine. I've been looking at tweaking the validation for other blocks, so this would be a good use case.

@johngodley johngodley merged commit 4f8ffab into master Aug 21, 2018
@johngodley johngodley deleted the fix/invalid-html-crash branch August 21, 2018 14:14
mcsf added a commit that referenced this pull request Aug 5, 2021
As of #22506 (May 2020), `isEquivalentHTML` short-circuits if its two
given strings are identical, and thus no longer treats invalid HTML
differently. Remove the mention of special handling, which had been
introduced in #8304 (August 2018).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
[Feature] Block API API that allows to express the block paradigm. [Feature] Blocks Overall functionality of blocks
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants