-
-
Notifications
You must be signed in to change notification settings - Fork 903
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Nokogiri on JRuby adds tbody tag to parsed document, resulting in inconsistent query results #1803
Comments
Hello! And thank you for raising this issue. What you're seeing is due to the change in 2e89378, whose commit message I'll reproduce here to help drive the conversation:
There are many differences between the parsers used in the JRuby and the MRI/YARV implementations. Xerces/NekoHTML is used by JRuby and libxml2 is used by MRI/YARV. Nokogiri is intended to be a Ruby wrapper around these parsers, and so making them behave identically is an anti-goal. In particular, "fixing up markup" has shown itself to be a rich vein of frustration for users who expect the two implementations to behave the same (or who expect either of them to behave the same as Browser X). And I'm sorry about that. I hope I've explained what's going on (underlying parsers behaving differently, and a broken monkeypatch was removed which attempted to make them behave the same in this case) in such a way that it seems defensible that Nokogiri allows this difference in behavior to exist. |
@flavorjones, thank you for the explanation. What you wrote make sense, and I can understand the challenge at hand, though I can't say I'm very comfortable with the differences. For me, it's very important that libraries behave the same across various Ruby implementations. That's one of the key benefits of using a library over maintaining code inside the application. I'll just have to keep in mind that nokogiri doesn't subscribe to that model when choosing to use it. I think what caught me off guard the most is that this change came in a patch release (1.8.2 to 1.8.3). I was definitely not expecting the behavior of nokogiri to fundamentally change by making that upgrade. Had it been a major or even a minor version, I might have consulted the changelog carefully. In the future, I'd ask that you consider reserving such changes for major (ideally) or minor releases so that users like myself aren't surprised when upgrading to address a security bulletin. |
Starting with Nokogiri 1.8.3, JRuby introduces a
tbody
tag in the parsed document where there isn't one in the source. This causes an inconsistently in the XPath (and CSS) query results between MRI and JRuby.Consider the following HTML fragment:
Now consider the XPath query:
When using Nokogiri 1.8.5 on MRI (or Nokogiri <= 1.8.2 on JRuby), the size of the result set of this query is:
However, when using Nokogiri 1.8.5 on JRuby, the size of the result set changes to:
The following program can be used to demonstrate this problem:
Note that the equivalent CSS query also exhibits the same problem:
The problem can be seen by converting the parsed document to a string, which outputs:
Nokogiri should not be introducing a
tbody
tag in the parsed HTML tree where there isn't one in the source (or MRI should do the same).Nokogiri version: 1.8.5
Ruby version: jruby 9.2.0.0 (2.5.0) 2018-05-24 81156a8 OpenJDK 64-Bit Server VM 25.181-b15 on 1.8.0_181-b15 +jit
OS: linux-x86_64
The text was updated successfully, but these errors were encountered: