Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nokogiri lose < > #1294

Closed
Able1991 opened this issue Jun 4, 2015 · 2 comments
Closed

Nokogiri lose < > #1294

Able1991 opened this issue Jun 4, 2015 · 2 comments

Comments

@Able1991
Copy link

Able1991 commented Jun 4, 2015

xml have a special character �
if it is not removed from the document all symbols & lt; & gt; after it lost

part = nil;
File.open('part.xml','r') { |f| part = f.read }
File.open('good.xml','wb') do |f|
  f.puts Nokogiri::XML(part.gsub('','-')).to_xml
end
File.open('bad.xml','wb') do |f|
  f.puts Nokogiri::XML(part).to_xml
end

https://gist.github.com/Able1991/030665226c03478747bc - 'part.xml'
This is a known problem? Any other code can lead to errors?

@twalpole
Copy link
Contributor

For XML 1.0 the characters below &#32; (&#x20), other than 0x9, 0xA, 0xD, are illegal in the document ( http://www.w3.org/TR/REC-xml/#charsets ), so technically your XML document is illegal. That being said the behavior of skipping further < and > characters in the document is strange.

@flavorjones
Copy link
Member

All, this behavior is dependent on the underlying XML/HTML parsing library, and not Nokogiri.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants