-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Distinguish user-input from structural HTML text #17789
Conversation
In this patch I'm starting to propose some shifts that make a more intentional distinction between text that someone _typed_ in to the editor and text that a component produces while rendering to HTML. A motivation for this change comes from the code editor, or a paragraph block where someone enters valid HTML markup. We want `<hr />`, when typed in, to _always_ appear as `<hr />` when viewing the rendered page. It happens that a combination of transformations in the editor and also within the WordPress core can unintentionally mangle these user-input sequences. Sometimes it's because WordPress is trying to preserve them; sometimes it's because we're trying to filter-out unwanted input. In this patch I've created new `escapeHTML` and `unescapeHTML` functions whose roles are to serialize and unserialize all user input so that we can prevent that mangling later on. Currently they are implemented by replacing characters with their matching named character entities. In the future it may not stay this way. This patch is incomplete but it's meant to start the journey.
It looks like we are escaping the formatting markup of serialized And it shows when rendering a post as well: Is there an easy way to avoid escaping the formatting? cc @ellatrix? |
@dmsnell Apologies for the late reply. Could you elaborate on the use case? You want to preserve |
I think the goal is to fully escape everything the user typed on the client, avoiding markup used for structure or formatting that the editor inserts. This would provide a clear distinction between say a So, to avoid bugs like these: #16252. |
Thanks @epiqueras - that's correct. @ellatrix you wrote this elsewhere…
Your quote probably sums up the core problem better than my words have: when we store the Once we serialize the data though that currently gets mushed together. This can be fine if we can reliably unserialize the content back into memory without loss. Currently though this is a lossy process and that's why this PR exists. In other words if I type Although I know there are still issues with this PR what I'd like to do is augment the serialization so that we preserve the text that someone enters intentionally into the blocks so that it comes back the way it was entered. This implies that what we save may not match the string they entered because we're preserving the final output and the output in the block editor's memory.� The above might serialize as Hope this makes more sense. |
I'm still not following entirely. If I type the example you gave, I'll get |
I don't see either of these problems. I see both correctly rendered on the front end. |
I created an alternative PR, #17994, to fix the ampersand issues. I think that should be a complete fix. I don't think we have to escape anything else for editable HTML. Would appreciate some testing! |
#17994 has been merged. Can this PR be closed? |
Thanks for the work on this @dmsnell. I'm glad we figured out a fix. |
Description
In this patch I'm starting to propose some shifts that make a more
intentional distinction between text that someone typed in to the
editor and text that a component produces while rendering to HTML.
A motivation for this change comes from the code editor, or a paragraph
block where someone enters valid HTML markup. We want
<hr />
, whentyped in, to always appear as
<hr />
when viewing the rendered page.It happens that a combination of transformations in the editor and also
within the WordPress core can unintentionally mangle these user-input
sequences. Sometimes it's because WordPress is trying to preserve them;
sometimes it's because we're trying to filter-out unwanted input.
In this patch I've created new
escapeHTML
andunescapeHTML
functionswhose roles are to serialize and unserialize all user input so that we
can prevent that mangling later on. Currently they are implemented by
replacing characters with their matching named character entities. In
the future it may not stay this way.
This patch is incomplete but it's meant to start the journey.
Goals
Whenever someone uses
<PlainText />
or<RichText />
they shouldn't have tothink about or worry about serialization/unserialization. Regardless, the content
which someone types into the editor should be preserved visually on page render.