-
-
Notifications
You must be signed in to change notification settings - Fork 903
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[bug] Nokogiri::XML::Reader#inner_xml returns NCR encoded attributes even if the encoding is set to utf-8 in #from_memory call. #2891
Comments
Thanks for reporting this. The behavior of the DOM parsing methods is slightly different, which is interesting: #! /usr/bin/env ruby
require 'bundler/inline'
gemfile do
source 'https://rubygems.org'
gem 'nokogiri', '1.13.8'
end
require 'nokogiri'
xml = <<~XML
<test><anotación tipo="inspiración">(inspiración)</anotación></test>
XML
Nokogiri::XML::Document.parse(xml).to_xml
# => "<?xml version=\"1.0\"?>\n" +
# "<test>\n" +
# " <anotación tipo=\"inspiración\">(inspiración)</anotación>\n" +
# "</test>\n"
Nokogiri::XML::Document.parse(xml, nil, "UTF-8").to_xml
# => "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n" +
# "<test>\n" +
# " <anotación tipo=\"inspiración\">(inspiración)</anotación>\n" +
# "</test>\n" I'll investigate! |
Sorry for not updating sooner. This appears to be intentional libxml2 behavior when a document does not declare an encoding. However, I have some changes on a branch to work around this behavior within Nokogiri. Just need an hour or two to explore edge cases and build confidence that it's not breaking anything else encoding-related. |
The issue is in libxml2's serializer. If it wasn't told to use a specific encoding, it will use the one from the XML declaration. If there's no encoding in the XML declaration, it will encode non-ASCII characters with NCRs. (This is confusing since UTF-8 is the XML default and shouldn't make a difference. It probably comes from a time when UTF-8 wasn't as ubiquitous.) The best solution is use "UTF-8" instead of NULL when serializing documents with |
@nwellnhof Thanks for validating -- my working branch is doing exactly that: setting |
when it's not specified either as a method param or in the document Fixes #2891
when it's not specified either as a method param or in the document Fixes #2891
See #3084 |
when it's not specified either as a method param or in the document Fixes #2891
when it's not specified either as a method param or in the document Fixes #2891
when it's not specified either as a method param or in the document Fixes #2891
when it's not specified either as a method param or in the document Fixes #2891
**What problem is this PR intended to solve?** default Reader node encoding to UTF-8 when it's not specified either as a method param or in the document Fixes #2891 **Have you included adequate test coverage?** Yes, I've added coverage. **Does this change affect the behavior of either the C or the Java implementations?** Yes, this updates the C implementation but does not update the Java implementation because Reader encoding is already wonky there in a few edge cases.
Please describe the bug
Nokogiri::XML::Reader#inner_xml
returns NCR encoded attributes even if the encoding is set toutf-8
in#from_memory
call.It does not happen if the XML input sets the encoding with
<?xml version="1.0" encoding="UTF-8"?>
.It only happens to attributes, elements and text nodes are correctly encoded.
Help us reproduce what you're seeing
Environment
Additional context
The text was updated successfully, but these errors were encountered: