-
-
Notifications
You must be signed in to change notification settings - Fork 904
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Output of #to_xml munged beyond certain file size using UTF-16 declaration #752
Comments
An alternative fix/workaround comes from the Stack Overflow question. Instead of:
...use:
This produces non-munged output. |
@Phrogz, thanks for opening this issue and apologies for the embarrassingly long time it's taken to respond. This is likely a libxml2 parsing bug. It feels similar in nature to these:
and I'll try to fix and send a PR upstream ... might need a few days. |
Phew, this was a tricky one to figure out, but it turns out that Nokogiri wasn't using the proper encoding after libxml2 flushed its internal buffer for the first time. As long as a UTF-16 document was longer than ~4000 code points, this bug would be triggered. |
Fixed by #2434, will be in the next minor release of Nokogiri (v1.14.0) |
Also see related #2447 |
For more details see http://stackoverflow.com/q/12162548/405017
Given a file on disk with UTF-16LE encoding and the contents:
The output of reading in this file and calling
to_xml
is broken:If I delete some of the text out of the
<Bar>
CDATA, the output is fixed.I can query and serialize elements that are munged in the output just fine:
If I remove the XML declaration from the input before parsing the document, the output is fixed:
Nokogiri 1.5.5 on Ruby 1.9.3p194 (2012-04-20) [i386-mingw32] on Windows 7
The text was updated successfully, but these errors were encountered: