Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Distinguish user-input from structural HTML text #17789

Closed
wants to merge 1 commit into from

Conversation

dmsnell
Copy link
Member

@dmsnell dmsnell commented Oct 6, 2019

Description

In this patch I'm starting to propose some shifts that make a more
intentional distinction between text that someone typed in to the
editor and text that a component produces while rendering to HTML.

A motivation for this change comes from the code editor, or a paragraph
block where someone enters valid HTML markup. We want <hr />, when
typed in, to always appear as <hr /> when viewing the rendered page.

It happens that a combination of transformations in the editor and also
within the WordPress core can unintentionally mangle these user-input
sequences. Sometimes it's because WordPress is trying to preserve them;
sometimes it's because we're trying to filter-out unwanted input.

In this patch I've created new escapeHTML and unescapeHTML functions
whose roles are to serialize and unserialize all user input so that we
can prevent that mangling later on. Currently they are implemented by
replacing characters with their matching named character entities. In
the future it may not stay this way.

This patch is incomplete but it's meant to start the journey.

Goals

Whenever someone uses <PlainText /> or <RichText /> they shouldn't have to
think about or worry about serialization/unserialization. Regardless, the content
which someone types into the editor should be preserved visually on page render.

In this patch I'm starting to propose some shifts that make a more
intentional distinction between text that someone _typed_ in to the
editor and text that a component produces while rendering to HTML.

A motivation for this change comes from the code editor, or a paragraph
block where someone enters valid HTML markup. We want `<hr />`, when
typed in, to _always_ appear as `<hr />` when viewing the rendered page.

It happens that a combination of transformations in the editor and also
within the WordPress core can unintentionally mangle these user-input
sequences. Sometimes it's because WordPress is trying to preserve them;
sometimes it's because we're trying to filter-out unwanted input.

In this patch I've created new `escapeHTML` and `unescapeHTML` functions
whose roles are to serialize and unserialize all user input so that we
can prevent that mangling later on. Currently they are implemented by
replacing characters with their matching named character entities. In
the future it may not stay this way.

This patch is incomplete but it's meant to start the journey.
@epiqueras
Copy link
Contributor

It looks like we are escaping the formatting markup of serialized RichText content:

Screen Shot 2019-10-14 at 1 45 30 PM

And it shows when rendering a post as well:

Screen Shot 2019-10-14 at 2 08 20 PM

Is there an easy way to avoid escaping the formatting? cc @ellatrix?

@ellatrix
Copy link
Member

@dmsnell Apologies for the late reply. Could you elaborate on the use case? You want to preserve <hr /> typed in the HTML view of a RichText field (e.g. a paragraph)? And render it as a line on the front end? Or are you saying we should escape the HTML and render it escaped on the front end?

@epiqueras
Copy link
Contributor

I think the goal is to fully escape everything the user typed on the client, avoiding markup used for structure or formatting that the editor inserts. This would provide a clear distinction between say a < from an a tag inserted by the editor and one typed in a math equation by a user.

So, to avoid bugs like these: #16252.

@dmsnell
Copy link
Member Author

dmsnell commented Oct 16, 2019

Thanks @epiqueras - that's correct. @ellatrix you wrote this elsewhere…

Later on we settled on using it only internally and continuously serialise the value to HTML to pass to the block, to keep things simple and backward compatible.

Your quote probably sums up the core problem better than my words have: when we store the RickText data in memory it stores each character typed and each string pasted in. There's a separation between the text values themselves and the attributes which indicate markup attributes to that text.

Once we serialize the data though that currently gets mushed together. This can be fine if we can reliably unserialize the content back into memory without loss. Currently though this is a lossy process and that's why this PR exists.

In other words if I type <strong>Really Important</strong> inside of a RichText block or if I type <code>5 < 10 > 2</code> inside a code block then I expect those strings to render exactly like that on page view. Right now those HTML tags and symbols get mangled by WordPress.

Although I know there are still issues with this PR what I'd like to do is augment the serialization so that we preserve the text that someone enters intentionally into the blocks so that it comes back the way it was entered. This implies that what we save may not match the string they entered because we're preserving the final output and the output in the block editor's memory.�

The above might serialize as &lt;strong&gt;Really Important&lt;&soli;strong&gt; and similarly for the code block. Most of the work I think involves determining when WordPress is mangling the text and transforming the input to avoid that. Right now because we store > as the raw HTML we get &gt; rendered on page view. I'd rather us store &gt; in the raw HTML and have > rendered.

Hope this makes more sense.

@ellatrix
Copy link
Member

I'm still not following entirely. If I type the example you gave, I'll get <p>&lt;strong>Really important!&lt;/strong></p>. The typed text is escaped. What do you expect instead?

@ellatrix
Copy link
Member

In other words if I type Really Important inside of a RichText block or if I type 5 < 10 > 2 inside a code block then I expect those strings to render exactly like that on page view. Right now those HTML tags and symbols get mangled by WordPress.

I don't see either of these problems. I see both correctly rendered on the front end.

@ellatrix ellatrix mentioned this pull request Oct 17, 2019
5 tasks
@ellatrix
Copy link
Member

I created an alternative PR, #17994, to fix the ampersand issues. I think that should be a complete fix. I don't think we have to escape anything else for editable HTML. Would appreciate some testing!

@mcsf
Copy link
Contributor

mcsf commented Nov 7, 2019

#17994 has been merged. Can this PR be closed?

@dmsnell dmsnell closed this Nov 7, 2019
@dmsnell dmsnell deleted the fix/distinguish-user-input-from-html branch November 7, 2019 14:18
@ellatrix
Copy link
Member

ellatrix commented Nov 7, 2019

Thanks for the work on this @dmsnell. I'm glad we figured out a fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants