Distinguish user-input from structural HTML text #17789

dmsnell · 2019-10-06T00:42:20Z

Description

In this patch I'm starting to propose some shifts that make a more
intentional distinction between text that someone typed in to the
editor and text that a component produces while rendering to HTML.

A motivation for this change comes from the code editor, or a paragraph
block where someone enters valid HTML markup. We want <hr />, when
typed in, to always appear as <hr /> when viewing the rendered page.

It happens that a combination of transformations in the editor and also
within the WordPress core can unintentionally mangle these user-input
sequences. Sometimes it's because WordPress is trying to preserve them;
sometimes it's because we're trying to filter-out unwanted input.

In this patch I've created new escapeHTML and unescapeHTML functions
whose roles are to serialize and unserialize all user input so that we
can prevent that mangling later on. Currently they are implemented by
replacing characters with their matching named character entities. In
the future it may not stay this way.

This patch is incomplete but it's meant to start the journey.

Goals

Whenever someone uses <PlainText /> or <RichText /> they shouldn't have to
think about or worry about serialization/unserialization. Regardless, the content
which someone types into the editor should be preserved visually on page render.

In this patch I'm starting to propose some shifts that make a more intentional distinction between text that someone _typed_ in to the editor and text that a component produces while rendering to HTML. A motivation for this change comes from the code editor, or a paragraph block where someone enters valid HTML markup. We want `<hr />`, when typed in, to _always_ appear as `<hr />` when viewing the rendered page. It happens that a combination of transformations in the editor and also within the WordPress core can unintentionally mangle these user-input sequences. Sometimes it's because WordPress is trying to preserve them; sometimes it's because we're trying to filter-out unwanted input. In this patch I've created new `escapeHTML` and `unescapeHTML` functions whose roles are to serialize and unserialize all user input so that we can prevent that mangling later on. Currently they are implemented by replacing characters with their matching named character entities. In the future it may not stay this way. This patch is incomplete but it's meant to start the journey.

epiqueras · 2019-10-14T21:11:51Z

It looks like we are escaping the formatting markup of serialized RichText content:

And it shows when rendering a post as well:

Is there an easy way to avoid escaping the formatting? cc @ellatrix?

ellatrix · 2019-10-16T07:06:35Z

@dmsnell Apologies for the late reply. Could you elaborate on the use case? You want to preserve <hr /> typed in the HTML view of a RichText field (e.g. a paragraph)? And render it as a line on the front end? Or are you saying we should escape the HTML and render it escaped on the front end?

epiqueras · 2019-10-16T17:02:59Z

I think the goal is to fully escape everything the user typed on the client, avoiding markup used for structure or formatting that the editor inserts. This would provide a clear distinction between say a < from an a tag inserted by the editor and one typed in a math equation by a user.

So, to avoid bugs like these: #16252.

dmsnell · 2019-10-16T18:16:09Z

Thanks @epiqueras - that's correct. @ellatrix you wrote this elsewhere…

Later on we settled on using it only internally and continuously serialise the value to HTML to pass to the block, to keep things simple and backward compatible.

Your quote probably sums up the core problem better than my words have: when we store the RickText data in memory it stores each character typed and each string pasted in. There's a separation between the text values themselves and the attributes which indicate markup attributes to that text.

Once we serialize the data though that currently gets mushed together. This can be fine if we can reliably unserialize the content back into memory without loss. Currently though this is a lossy process and that's why this PR exists.

In other words if I type Really Important inside of a RichText block or if I type <code>5 < 10 > 2</code> inside a code block then I expect those strings to render exactly like that on page view. Right now those HTML tags and symbols get mangled by WordPress.

Although I know there are still issues with this PR what I'd like to do is augment the serialization so that we preserve the text that someone enters intentionally into the blocks so that it comes back the way it was entered. This implies that what we save may not match the string they entered because we're preserving the final output and the output in the block editor's memory.�

The above might serialize as Really Important<&soli;strong> and similarly for the code block. Most of the work I think involves determining when WordPress is mangling the text and transforming the input to avoid that. Right now because we store > as the raw HTML we get > rendered on page view. I'd rather us store > in the raw HTML and have > rendered.

Hope this makes more sense.

ellatrix · 2019-10-17T09:53:03Z

I'm still not following entirely. If I type the example you gave, I'll get Really important!. The typed text is escaped. What do you expect instead?

ellatrix · 2019-10-17T10:12:21Z

In other words if I type Really Important inside of a RichText block or if I type 5 < 10 > 2 inside a code block then I expect those strings to render exactly like that on page view. Right now those HTML tags and symbols get mangled by WordPress.

I don't see either of these problems. I see both correctly rendered on the front end.

ellatrix · 2019-10-17T11:40:09Z

I created an alternative PR, #17994, to fix the ampersand issues. I think that should be a complete fix. I don't think we have to escape anything else for editable HTML. Would appreciate some testing!

mcsf · 2019-11-07T11:40:26Z

#17994 has been merged. Can this PR be closed?

ellatrix · 2019-11-07T17:23:25Z

Thanks for the work on this @dmsnell. I'm glad we figured out a fix.

dmsnell requested review from mkaz, ellatrix and jorgefilipecosta October 6, 2019 00:42

dmsnell requested review from aduth, ajitbohra, daniloercoli, etoledom, SergioEstevao, Soean, talldan and youknowriad as code owners October 6, 2019 00:42

ellatrix mentioned this pull request Oct 17, 2019

Escape Editable HTML #17994

Merged

5 tasks

dmsnell closed this Nov 7, 2019

dmsnell deleted the fix/distinguish-user-input-from-html branch November 7, 2019 14:18

aduth mentioned this pull request Mar 17, 2020

Editor: PostTitle: Decode entities for display #20887

Closed

7 tasks

dmsnell mentioned this pull request Mar 6, 2024

Editing code block content in visual editor, replaces line breaks with br tags, output is on single line now. #59548

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Distinguish user-input from structural HTML text #17789

Distinguish user-input from structural HTML text #17789

dmsnell commented Oct 6, 2019 •

edited

Loading

epiqueras commented Oct 14, 2019

ellatrix commented Oct 16, 2019

epiqueras commented Oct 16, 2019

dmsnell commented Oct 16, 2019

ellatrix commented Oct 17, 2019

ellatrix commented Oct 17, 2019

ellatrix commented Oct 17, 2019

mcsf commented Nov 7, 2019

ellatrix commented Nov 7, 2019

Distinguish user-input from structural HTML text #17789

Distinguish user-input from structural HTML text #17789

Conversation

dmsnell commented Oct 6, 2019 • edited Loading

Description

Goals

epiqueras commented Oct 14, 2019

ellatrix commented Oct 16, 2019

epiqueras commented Oct 16, 2019

dmsnell commented Oct 16, 2019

ellatrix commented Oct 17, 2019

ellatrix commented Oct 17, 2019

ellatrix commented Oct 17, 2019

mcsf commented Nov 7, 2019

ellatrix commented Nov 7, 2019

dmsnell commented Oct 6, 2019 •

edited

Loading