-
-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add support to many html entities in HTML.unescape #3409
Conversation
968a24e
to
00b8c93
Compare
i think this is little crazy in std, maybe shard? |
I also think it's a bit too much, not even Ruby does this replacement :-) |
when "nbsp" then " " | ||
when "apos" then "'" | ||
when "nbsp" then '\u{A0}' | ||
when "iexcl" then '\u{A1}' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just interesting, what would be faster, case
or hash
lookup here? seems case should be slower, because of strcmp?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know Java optimises this by using a case on the strings hash, then strcmp to prevent hash collisions. Its a simple compiler optimisation step that we might want to persue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In any case I'd put this hash in another file, load it at runtime once and do a hash lookup to simplify the code.
I don't think that this is too much for the standard library. Its a standardized, verifiably correct, verifiably complete set of replacements specific to html. It shouldn't require much, if any maintenance to the list, as I doubt html will be adding any more of these now Unicode is prevalent. |
but may be then implement html5 spec? which allowed entities without |
Personaly, I would rewrite this without the slow regex. |
00b8c93
to
6fc7cce
Compare
Hi guys, thank you for your comments, I update my patch, please review again |
@RX14 Any idea how avoid the regex? |
@@ -57,6 +53,7 @@ module HTML | |||
else | |||
"&#x#{$1};" | |||
end | |||
when /\w{2,8}/ then HTML::ENTITIES[match] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think it should be HTML::ENTITIES[match]? || "&#{match};"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
😮 Thank you, I write a test for this case
6fc7cce
to
5e42a83
Compare
@dukex To avoid regex:
However, I believe the regex is not so bad for the starting point. As long as the code stays simple and readable, we can always tune the performance easily. But once the code is rewritten for the performance, it usually takes more effort to add features—and sometimes people give up touching it. I can't say which is better. Those companies I worked with usually had problems with less-readable codes, not slow codes. But this is an open-source project and also a std library for the language. I can't just apply the same rule. Of course, if you can achieve both, that might be the best. 😄 |
A conversion of arbitrary HTML entities to unescaped characters is a totally different application and a complete implementation is probably to extensive and specific to be included in the standard library. Besides, it can already be accomplished through the XML parser: XML.parse_html("<s>ý –").content # => "<s>ý –" |
Okay, I have been looking at implementations of HTML unescape in other languages. And it seems that most of them are actually resolving all (or almost all) HTML entities: Ruby I guess this makes some sense and Still, I would question if this method should be implemented in the standard lib anyway. There are not many use cases I can think of, where you would need to unescape HTML entities outside a HTML parser. In fact, at least in repos on Github it seems this method is used mostly in examples. As I have shown above, decoding of HTML entities can be easily accomplished through the |
Note: bracket = "
< < < < < < < <
< < < < < < <
< < < < < < <
< < < < < <
< < < < < < <
< < < < < <
< < < < < < < <
< < < < < <
< < < < < < < <
< < \x3c \x3C \u003c \u003C
"
require "html"
require "xml"
require "myhtml"
puts HTML.unescape(bracket)
puts XML.parse_html(bracket).content
puts Myhtml.decode_html_entities(bracket) |
Triage: I think that this pull request must be closed without any merge. |
Why? |
@da99 As it was already mentioned above, HTML entity expansion is quite a complex feature to implement. |
@straight-shoota < is entity in html5, see: https://html.spec.whatwg.org/multipage/named-characters.html#named-character-references |
Thanks @kostya! I didn't know that. I've updated my comment above. Okay, so for HTML entities, there is a list from W3C of currently 2331 entities which could (maybe even automatically) be inserted into a crystal source file. The complexity comes from the different entities between standards (HTML 4.0.1, XHTML, HTML5 or even XML?). If we decide to support only HTML5 (as Go seems to do, though it is not documented), the process becomes much more straightforward. I'd like to point out that |
👍 on the naming, it makes great sense to me. Not sure how useful |
I can't see why we need The reason is that you want |
Thanks @asterite |
Hi again,
Like I said in my PR #3374, I'm creating this PR to fix the
nbsp
and to add entities described in w3.org/TR/html401/sgml/entities.html to HTML.unescape.Someone have a better idea to organize this entities?
/cc @chaniks @RX14