Add HTML.unescape [Closes #3107] #3374

dukex · 2016-10-03T01:13:13Z

Given the ruby implementation CGI::unescapeHTML and the comments on #3226 I'm made this implementation of HTML.unescape.

~~I tried to translate the unescape for hexadecimal codes without success~~
~~Someone can help me?~~ Thanks @chaniks

chaniks · 2016-10-03T09:43:32Z

$1.to_i(16) for $1.hex.

dukex · 2016-10-03T11:01:53Z

❤️ @chaniks thank you

asterite · 2016-10-05T22:35:34Z

src/html.cr

+        end
+      when /\A#x([0-9a-f]+)\z/i
+        n = $1.to_i(16)
+        if n < charlimit


I think this should be n <= charlimit, since 0x10ffff is a valid unicode value

Done, thanks!

asterite · 2016-10-05T22:36:50Z

src/html.cr

+    charlimit = 0x10ffff
+
+    string.gsub(/&(apos|amp|quot|gt|lt|\#[0-9]+|\#[xX][0-9A-Fa-f]+);/) do |string, _match|
+      match = _match[1].dup


this dup doesn't seem to be needed

chaniks

I guess I have too many thoughts on this PR.
I'd better skip them..

But two things,

Integer overflow (e.g. &#2147483648;, &#x80000000)
  (can be generated with HTML.escape—reversibility)

And maybe specifying behaviors like &#0099999999; => &#99999999; and &#X111111; => &#x111111; in specs might be a good idea, if it is intended.
And maybe reusing Char::MAX_CODEPOINT.

I guess I should stop here.

asterite · 2016-10-07T11:39:12Z

@dukex Thank you for this! ❤️

dukex · 2016-10-07T11:39:17Z

Hi @chaniks

Fell free to send all your comments here, I want write the best code for the crystal and not just copy the ruby code, I appreciate your points here and I want read more.

Thanks for your comments,

I'm using the Char::MAX_CODEPOINT now, I searched for this constant but I maybe I would need to search more.

I update the code to unescape spaces( )

I think the the condition if n <= Char::MAX_CODEPOINT are handle with integer overflow error, you can explain more your comment?

You can help me with behaviors like &#0099999999; => &#99999999; it is correct change value or the code should returns &#0099999999;?

chaniks · 2016-10-07T15:11:15Z

@dukex to_i will raise an error if the string representation is out of integer range. :')

  is actually not a space, but I believe people will report if it becomes an issue. (Sorry I wasn't specific. My writing skill is really bad.)

Please never worry about my other comments. I wanted to edit it, but it was a review so I couldn't.. 😢

p.s. If you want dig more, this link might help: https://www.w3.org/TR/html401/sgml/entities.html

RX14 · 2016-10-07T16:19:38Z

here's a list of entities and their unicode code points. nbsp should certainly be fixed to be the correct codepoint, or removed.

dukex · 2016-10-07T17:42:18Z

@chaniks and @RX14 I'll address this bugs in my next PR, ok?

dukex force-pushed the add-html-unescape branch from 219487c to 2370970 Compare October 3, 2016 11:01

dukex force-pushed the add-html-unescape branch from 2370970 to 3877c73 Compare October 3, 2016 11:04

asterite reviewed Oct 5, 2016

View reviewed changes

chaniks suggested changes Oct 6, 2016

View reviewed changes

add HTML.unescape [Closes crystal-lang#3107]

6d02e69

dukex force-pushed the add-html-unescape branch from 3877c73 to 6d02e69 Compare October 7, 2016 11:06

asterite merged commit 0b886b6 into crystal-lang:master Oct 7, 2016

dukex deleted the add-html-unescape branch October 7, 2016 11:39

dukex mentioned this pull request Oct 11, 2016

add support to many html entities in HTML.unescape #3409

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add HTML.unescape [Closes #3107] #3374

Add HTML.unescape [Closes #3107] #3374

dukex commented Oct 3, 2016 •

edited

Loading

chaniks commented Oct 3, 2016

dukex commented Oct 3, 2016

asterite Oct 5, 2016

dukex Oct 7, 2016

asterite Oct 5, 2016

dukex Oct 7, 2016

chaniks left a comment

asterite commented Oct 7, 2016

dukex commented Oct 7, 2016

chaniks commented Oct 7, 2016

RX14 commented Oct 7, 2016

dukex commented Oct 7, 2016

Add HTML.unescape [Closes #3107] #3374

Add HTML.unescape [Closes #3107] #3374

Conversation

dukex commented Oct 3, 2016 • edited Loading

chaniks commented Oct 3, 2016

dukex commented Oct 3, 2016

asterite Oct 5, 2016

Choose a reason for hiding this comment

dukex Oct 7, 2016

Choose a reason for hiding this comment

asterite Oct 5, 2016

Choose a reason for hiding this comment

dukex Oct 7, 2016

Choose a reason for hiding this comment

chaniks left a comment

Choose a reason for hiding this comment

asterite commented Oct 7, 2016

dukex commented Oct 7, 2016

chaniks commented Oct 7, 2016

RX14 commented Oct 7, 2016

dukex commented Oct 7, 2016

dukex commented Oct 3, 2016 •

edited

Loading