Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add HTML.unescape [Closes #3107] #3374

Merged
merged 1 commit into from
Oct 7, 2016

Conversation

dukex
Copy link
Contributor

@dukex dukex commented Oct 3, 2016

Given the ruby implementation CGI::unescapeHTML and the comments on #3226 I'm made this implementation of HTML.unescape.

I tried to translate the unescape for hexadecimal codes without success
Someone can help me? Thanks @chaniks

@chaniks
Copy link

chaniks commented Oct 3, 2016

$1.to_i(16) for $1.hex.

@dukex dukex force-pushed the add-html-unescape branch from 219487c to 2370970 Compare October 3, 2016 11:01
@dukex
Copy link
Contributor Author

dukex commented Oct 3, 2016

❤️ @chaniks thank you

@dukex dukex force-pushed the add-html-unescape branch from 2370970 to 3877c73 Compare October 3, 2016 11:04
end
when /\A#x([0-9a-f]+)\z/i
n = $1.to_i(16)
if n < charlimit
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be n <= charlimit, since 0x10ffff is a valid unicode value

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, thanks!

charlimit = 0x10ffff

string.gsub(/&(apos|amp|quot|gt|lt|\#[0-9]+|\#[xX][0-9A-Fa-f]+);/) do |string, _match|
match = _match[1].dup
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this dup doesn't seem to be needed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ops, done!

Copy link

@chaniks chaniks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I have too many thoughts on this PR.
I'd better skip them..

But two things,

  • Integer overflow (e.g. &#2147483648;, &#x80000000)
  • &nbsp; (can be generated with HTML.escape—reversibility)

And maybe specifying behaviors like &#0099999999; => &#99999999; and &#X111111; => &#x111111; in specs might be a good idea, if it is intended.
And maybe reusing Char::MAX_CODEPOINT.

I guess I should stop here.

@dukex dukex force-pushed the add-html-unescape branch from 3877c73 to 6d02e69 Compare October 7, 2016 11:06
@asterite
Copy link
Member

asterite commented Oct 7, 2016

@dukex Thank you for this! ❤️

@asterite asterite merged commit 0b886b6 into crystal-lang:master Oct 7, 2016
@dukex
Copy link
Contributor Author

dukex commented Oct 7, 2016

Hi @chaniks

Fell free to send all your comments here, I want write the best code for the crystal and not just copy the ruby code, I appreciate your points here and I want read more.

Thanks for your comments,

I'm using the Char::MAX_CODEPOINT now, I searched for this constant but I maybe I would need to search more.

I update the code to unescape spaces(&nbsp;)

I think the the condition if n <= Char::MAX_CODEPOINT are handle with integer overflow error, you can explain more your comment?

You can help me with behaviors like &#0099999999; => &#99999999; it is correct change value or the code should returns &#0099999999;?

@dukex dukex deleted the add-html-unescape branch October 7, 2016 11:39
@chaniks
Copy link

chaniks commented Oct 7, 2016

@dukex to_i will raise an error if the string representation is out of integer range. :')

&nbsp; is actually not a space, but I believe people will report if it becomes an issue. (Sorry I wasn't specific. My writing skill is really bad.)

Please never worry about my other comments. I wanted to edit it, but it was a review so I couldn't.. 😢

p.s. If you want dig more, this link might help: https://www.w3.org/TR/html401/sgml/entities.html

@RX14
Copy link
Contributor

RX14 commented Oct 7, 2016

here's a list of entities and their unicode code points. nbsp should certainly be fixed to be the correct codepoint, or removed.

@dukex
Copy link
Contributor Author

dukex commented Oct 7, 2016

@chaniks and @RX14 I'll address this bugs in my next PR, ok?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants