Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

newlines treated differently in pure java implementation #444

Closed
tmm1 opened this issue Apr 7, 2011 · 5 comments
Closed

newlines treated differently in pure java implementation #444

tmm1 opened this issue Apr 7, 2011 · 5 comments

Comments

@tmm1
Copy link

tmm1 commented Apr 7, 2011

kind of a stupid bug to be reporting, but its breaking some tests that assume the MRI behavior.

in MRI, newlines appear to be stripped:

$ ruby -rubygems -ve' require "nokogiri"; p Nokogiri::VERSION; p Nokogiri::HTML::DocumentFragment.parse("<p>hi</p>\n").to_html '
ruby 1.8.7 (2010-12-23 patchlevel 330) [i686-darwin10.5.0]
"1.5.0.beta.4"
"<p>hi</p>"

in the java version, the newline is passed through to the output:

$ ruby -rubygems -ve' require "nokogiri"; p Nokogiri::VERSION; p Nokogiri::HTML::DocumentFragment.parse("<p>hi</p>\n").to_html '
jruby 1.6.0 (ruby-1.8.7-p330) (2011-04-06 aa7d946) (Java HotSpot(TM) 64-Bit Server VM 1.6.0_24) [darwin-x86_64-java]
"1.5.0.beta.4"
"<p>hi</p>\n"
@yokolet
Copy link
Member

yokolet commented Apr 7, 2011

This is not a stupid bug. This is a rather serious parsing problem. Java version creates Text node of "\n" as a sibling of element p, while libxml version doesn't. I'll have a look.

@headius
Copy link
Contributor

headius commented Apr 7, 2011

If I remember right, there's modes in most of the Java parsers to preserve all whitespace or not. I think that's what's at play here.

@yokolet
Copy link
Member

yokolet commented Apr 8, 2011

@headius Xerces does have that option. But, xerces needs schema (grammar) to treat whitespaces. If no schema is given, ignore whitespace option does nothing. Xerces is really strict.

@yokolet
Copy link
Member

yokolet commented Apr 8, 2011

@tmm1 The bug is fixed in rev. 3f2e575

When no grammar is given, trailing whitespace is cut out from a fragment string. If possible, would you try this using master branch?

@yokolet
Copy link
Member

yokolet commented May 11, 2011

This should be closed.

@yokolet yokolet closed this as completed May 11, 2011
jvshahid added a commit that referenced this issue Feb 18, 2016
The fix for #444 turns out to cause issues with frozen strings
(see #1077). Furthermore, MRI as of this commit behaves similar to
JRuby, i.e. it adds the extra newline at the end of the fragment.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants