Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HTML.fragment() ignores encoding #305

Closed
sunshineco opened this issue Jul 4, 2010 · 7 comments
Closed

HTML.fragment() ignores encoding #305

sunshineco opened this issue Jul 4, 2010 · 7 comments

Comments

@sunshineco
Copy link

Given the input:

# Windows: coding: IBM437
require 'nokogiri'
s = '<p>François</p>'
puts "[#{s.encoding}] #{s}"
d = Nokogiri::HTML.parse(s)
ds = d.css('p').to_xhtml
puts "[#{ds.encoding}] #{ds}"
f = Nokogiri::HTML.fragment(s)
fs = f.to_xhtml
puts "[#{fs.encoding}] #{fs}"

Output is:

C:\>ruby fragmentbug.rb
[IBM437] <p>François</p>
[IBM437] <p>Fran&#xE7;ois</p>
output error : string is not in UTF-8
[UTF-8] <p></p>

Note the failure of to_xhtml() in the fragment case. Specifically, HTML.fragment() provides no mechanism for dealing with encoding and instead assumes unconditionally that the incoming string is UTF-8: http://github.com/tenderlove/nokogiri/blob/REL_1.4.2/lib/nokogiri/html/document_fragment.rb#L8

HTML.parse(), on the other hand, interrogates the encoding of the incoming string if encoding is not specified explicitly: http://github.com/tenderlove/nokogiri/blob/REL_1.4.2/lib/nokogiri/html/document.rb#L71

Additional information:

C:\>ruby -v
ruby 1.9.1p378 (2010-01-10 revision 26273) [i386-mingw32]

C:\>nokogiri -v

---
warnings: []

nokogiri: 1.4.2.1
ruby:
  version: 1.9.1
  platform: i386-mingw32
libxml:
  binding: extension
  compiled: 2.7.7
  loaded: 2.7.7
@tenderlove
Copy link
Member

using encoding set on string when parsing document fragments. closed by 602d2a5

@sunshineco
Copy link
Author

Regarding 602d2a5: To improve support for Ruby <= 1.8.x, would it make sense for HTML.fragment() to accept an optional 'encoding' argument akin to the like-named HTML.parse() argument?

@tenderlove
Copy link
Member

Ya, that would probably be good. Added an encoding parameter here:

9490d0e

@sunshineco
Copy link
Author

Hi Aaron,

Thank you for the quick response to this bug report.

Regarding 9490d0e: As implemented, HTML.fragment() unconditionally ignores its 'encoding' argument if 'tags' responds to #encoding. This behavior differs dramatically from HTML.parse() which always employs 'encoding' when provided explicitly by the client. (In parse(), 'encoding' defaults to nil.) There are a few reasons why it might be best for HTML.fragment() to respect 'encoding' if provided:

  1. Consistency with HTML.parse().
  2. Presumption that client knows what he is doing when overriding the encoding already embedded in the string.
  3. Elimination of high surprise factor associated with unconditionally ignoring an incoming argument.

@tenderlove
Copy link
Member

I agree! In fact, my tests desire that functionality. This commit should take care of it: bde0aac

Thanks for being patient with me! :-D

@sunshineco
Copy link
Author

Thanks again, though I must bother you once more. In bde0aac, you missed the default argument value encoding='UTF-8' which was added to the higher-level HTML.fragment() method in 9490d0e. This also should default to nil. See: http://github.com/tenderlove/nokogiri/commit/9490d0e3353db528d17dcb188ef58859505f00d9#L0R27

@tenderlove
Copy link
Member

Hah. No problem. Thanks for catching this. Should be fixed here: a5df08d

This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants