Improve decoding of HTML entities #5064

straight-shoota · 2017-09-30T14:43:35Z

This PR adds a few fixes for several edge cases that were broken from #5055 or before

named character references including numbers are replaced
non-matching named character references with trailing semicolon are left in place
invalid numeric character references are replaced with replacement character
numeric character references resulting in noncharacters or control characters (except space) are left in place
numeric character references between 0x80 and 0x9F are replaced with compatibility replacements for old numeric entities from Windows-1252
the API documentation states that this method recognizes character references from HTML5. Other doctypes have different entities and even different rules for numeric references.

The code is based on the HTML5 tokenizing guidelines and inspired by the implementations for Go and PHP.

… spec

Sija · 2017-09-30T16:11:18Z

src/html.cr

+
+  # see https://html.spec.whatwg.org/multipage/parsing.html#numeric-character-reference-end-state
+  private def self.decode_codepoint(codepoint)
+    if 0x80 <= codepoint <= 0x9F


case switch might look neater here.

Sija · 2017-09-30T16:12:00Z

src/html.cr

+      "\uFFFD"
+    elsif 0xFDD0 <= codepoint <= 0xFDEF ||                                                    # unicode noncharacters
+ codepoint & 0xFFFF >= 0xFFFE ||                                                              # last two of each plane (nonchars) disallowed
+ (codepoint < 0x0020 && codepoint != 0x0009 && codepoint != 0x000A && codepoint != 0x000C) || # unicode control characters


codepoint != 0x0009 && codepoint != 0x000A && codepoint != 0x000C

->

!{0x0009, 0x000A, 0x000C}.includes?(codepoint)

oprypin · 2017-09-30T18:14:54Z

spec/std/html_spec.cr

+    end
+
+    it "unescapes characters above Char::MAX_CODEPOINT" do
+      str = HTML.unescape("limit &#x110000;")


I don't get it, if 10FFFF is not a character, then why should something right after it be a character

That's a typo. It doesn't decode &#x110000; but replaces it with replacement character \uFFFD.

Question still stands. 􏿿->􏿿 but &#x110000; -> \uFFFD - why?

􏿿 is a valid codepoint, yet not supported as a character reference because it's a unicode noncharacter. Therefore it is not decoded.
&#x110000; is an invalid codepoint and therefore replaced by replacement character.

See Numeric character reference end state for details.

straight-shoota · 2017-09-30T19:04:09Z

@Sija comment
Thanks, case looks better I think.
Unfortunately I couldn't put codepoint & 0xFFFF >= 0xFFFE in a when clause (maybe with some case-fu? but probably not possible). I moved all checks for disallowed characters into the else part to keep them together.

straight-shoota added 4 commits September 30, 2017 14:13

Improve method doc

940aac9

fix invalid entity with trailing semicolon

1252af4

fix entites with numerical characters

5a02383

Improve decoding of numerical character references according to HTML5…

a149f70

… spec

Sija reviewed Sep 30, 2017

View reviewed changes

Use case instead of if branches

9078c02

oprypin reviewed Sep 30, 2017

View reviewed changes

straight-shoota added 3 commits September 30, 2017 20:47

group all disallowed codepoints outside case statement

2ebb0b2

fix typo

c66b07e

Simplify branches

5637868

replacement character as Char

39244b4

straight-shoota force-pushed the jm-html-entities branch from 961dcd9 to 39244b4 Compare September 30, 2017 20:23

asterite approved these changes Oct 3, 2017

View reviewed changes

RX14 approved these changes Oct 4, 2017

View reviewed changes

RX14 added kind:feature topic:stdlib labels Oct 4, 2017

RX14 added this to the Next milestone Oct 4, 2017

RX14 merged commit 17ac8a2 into crystal-lang:master Oct 4, 2017

straight-shoota deleted the jm-html-entities branch October 4, 2017 08:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve decoding of HTML entities #5064

Improve decoding of HTML entities #5064

straight-shoota commented Sep 30, 2017 •

edited

Loading

Sija Sep 30, 2017

Sija Sep 30, 2017

oprypin Sep 30, 2017

straight-shoota Sep 30, 2017

oprypin Sep 30, 2017

straight-shoota Sep 30, 2017

straight-shoota commented Sep 30, 2017

Improve decoding of HTML entities #5064

Improve decoding of HTML entities #5064

Conversation

straight-shoota commented Sep 30, 2017 • edited Loading

Sija Sep 30, 2017

Choose a reason for hiding this comment

Sija Sep 30, 2017

Choose a reason for hiding this comment

oprypin Sep 30, 2017

Choose a reason for hiding this comment

straight-shoota Sep 30, 2017

Choose a reason for hiding this comment

oprypin Sep 30, 2017

Choose a reason for hiding this comment

straight-shoota Sep 30, 2017

Choose a reason for hiding this comment

straight-shoota commented Sep 30, 2017

straight-shoota commented Sep 30, 2017 •

edited

Loading