-
-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve decoding of HTML entities #5064
Conversation
src/html.cr
Outdated
|
||
# see https://html.spec.whatwg.org/multipage/parsing.html#numeric-character-reference-end-state | ||
private def self.decode_codepoint(codepoint) | ||
if 0x80 <= codepoint <= 0x9F |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
case
switch might look neater here.
src/html.cr
Outdated
"\uFFFD" | ||
elsif 0xFDD0 <= codepoint <= 0xFDEF || # unicode noncharacters | ||
codepoint & 0xFFFF >= 0xFFFE || # last two of each plane (nonchars) disallowed | ||
(codepoint < 0x0020 && codepoint != 0x0009 && codepoint != 0x000A && codepoint != 0x000C) || # unicode control characters |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
codepoint != 0x0009 && codepoint != 0x000A && codepoint != 0x000C
->
!{0x0009, 0x000A, 0x000C}.includes?(codepoint)
spec/std/html_spec.cr
Outdated
end | ||
|
||
it "unescapes characters above Char::MAX_CODEPOINT" do | ||
str = HTML.unescape("limit �") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't get it, if 10FFFF is not a character, then why should something right after it be a character
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a typo. It doesn't decode �
but replaces it with replacement character \uFFFD
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question still stands. 
->
but �
-> \uFFFD
- why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.

is a valid codepoint, yet not supported as a character reference because it's a unicode noncharacter. Therefore it is not decoded.
�
is an invalid codepoint and therefore replaced by replacement character.
See Numeric character reference end state for details.
961dcd9
to
39244b4
Compare
This PR adds a few fixes for several edge cases that were broken from #5055 or before
0x80
and0x9F
are replaced with compatibility replacements for old numeric entities from Windows-1252The code is based on the HTML5 tokenizing guidelines and inspired by the implementations for Go and PHP.