HTML.fragment() ignores encoding #305

sunshineco · 2010-07-04T19:22:00Z

Given the input:

# Windows: coding: IBM437
require 'nokogiri'
s = '<p>François</p>'
puts "[#{s.encoding}] #{s}"
d = Nokogiri::HTML.parse(s)
ds = d.css('p').to_xhtml
puts "[#{ds.encoding}] #{ds}"
f = Nokogiri::HTML.fragment(s)
fs = f.to_xhtml
puts "[#{fs.encoding}] #{fs}"

Output is:

C:\>ruby fragmentbug.rb
[IBM437] <p>François</p>
[IBM437] <p>Fran&#xE7;ois</p>
output error : string is not in UTF-8
[UTF-8] <p></p>

Note the failure of to_xhtml() in the fragment case. Specifically, HTML.fragment() provides no mechanism for dealing with encoding and instead assumes unconditionally that the incoming string is UTF-8: http://github.com/tenderlove/nokogiri/blob/REL_1.4.2/lib/nokogiri/html/document_fragment.rb#L8

HTML.parse(), on the other hand, interrogates the encoding of the incoming string if encoding is not specified explicitly: http://github.com/tenderlove/nokogiri/blob/REL_1.4.2/lib/nokogiri/html/document.rb#L71

Additional information:

C:\>ruby -v
ruby 1.9.1p378 (2010-01-10 revision 26273) [i386-mingw32]

C:\>nokogiri -v

---
warnings: []

nokogiri: 1.4.2.1
ruby:
  version: 1.9.1
  platform: i386-mingw32
libxml:
  binding: extension
  compiled: 2.7.7
  loaded: 2.7.7

The text was updated successfully, but these errors were encountered:

tenderlove · 2010-07-04T23:02:27Z

using encoding set on string when parsing document fragments. closed by 602d2a5

sunshineco · 2010-07-05T16:00:15Z

Regarding 602d2a5: To improve support for Ruby <= 1.8.x, would it make sense for HTML.fragment() to accept an optional 'encoding' argument akin to the like-named HTML.parse() argument?

tenderlove · 2010-07-05T18:03:23Z

Ya, that would probably be good. Added an encoding parameter here:

9490d0e

sunshineco · 2010-07-06T10:57:33Z

Hi Aaron,

Thank you for the quick response to this bug report.

Regarding 9490d0e: As implemented, HTML.fragment() unconditionally ignores its 'encoding' argument if 'tags' responds to #encoding. This behavior differs dramatically from HTML.parse() which always employs 'encoding' when provided explicitly by the client. (In parse(), 'encoding' defaults to nil.) There are a few reasons why it might be best for HTML.fragment() to respect 'encoding' if provided:

Consistency with HTML.parse().
Presumption that client knows what he is doing when overriding the encoding already embedded in the string.
Elimination of high surprise factor associated with unconditionally ignoring an incoming argument.

tenderlove · 2010-07-06T17:05:29Z

I agree! In fact, my tests desire that functionality. This commit should take care of it: bde0aac

Thanks for being patient with me! :-D

sunshineco · 2010-07-06T17:18:08Z

Thanks again, though I must bother you once more. In bde0aac, you missed the default argument value encoding='UTF-8' which was added to the higher-level HTML.fragment() method in 9490d0e. This also should default to nil. See: http://github.com/tenderlove/nokogiri/commit/9490d0e3353db528d17dcb188ef58859505f00d9#L0R27

tenderlove · 2010-07-06T18:44:22Z

Hah. No problem. Thanks for catching this. Should be fixed here: a5df08d

This issue was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HTML.fragment() ignores encoding #305

HTML.fragment() ignores encoding #305

sunshineco commented Jul 4, 2010

tenderlove commented Jul 4, 2010

sunshineco commented Jul 5, 2010

tenderlove commented Jul 5, 2010

sunshineco commented Jul 6, 2010

tenderlove commented Jul 6, 2010

sunshineco commented Jul 6, 2010

tenderlove commented Jul 6, 2010

HTML.fragment() ignores encoding #305

HTML.fragment() ignores encoding #305

Comments

sunshineco commented Jul 4, 2010

tenderlove commented Jul 4, 2010

sunshineco commented Jul 5, 2010

tenderlove commented Jul 5, 2010

sunshineco commented Jul 6, 2010

tenderlove commented Jul 6, 2010

sunshineco commented Jul 6, 2010

tenderlove commented Jul 6, 2010