You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Using this script I'm getting something like /p[4] when I select the last p. This is because the browser "fixes" the invalid p (i.e. the second one with die div inside) by adding a closing p right before the div and a second opening p right after the closing div and just before the closing p. This results in an extra p making the last p the fourth one - although there are only three p in the original document.
When using the refered Javascript function which uses previousSibling the browser let it walk through all the p (including the extra p).
After I got the XPath (with /p[4]) I try to get the content of it by invoking at_xpath like in the following test:
require 'nokogiri'
require 'minitest/autorun'
class Test < MiniTest::Spec
describe "Node#at_xpath" do
it "should add an extra p after div" do
html = <<~HTML
<html>
<body>
<p>1</p>
<p>
<div>2</div>
</p>
<p>3</p>
</body>
</html>
HTML
doc = Nokogiri::HTML::Document.parse(html)
assert_equal '', doc.at_xpath("/html/body/p[3]").text // This p is added by the browser and should therefore be empty
assert_equal '3', doc.at_xpath("/html/body/p[4]").text
end
end
end
The problem is that Nokogiri simply omits/removes the closing p after the div instead of fixing it like the browser does (I checked Chrome, Firefox and IE which all act the same way and add an extra p).
While Nokogiri fixes many other mistakes in the markup (https://nokogiri.org/tutorials/ensuring_well_formed_markup.html) I don't understand why Nokogiri acts differently to many of the mainstream browsers in this case. And there is no parse option to change this behavior. Is this a bug or a "feature"?
The text was updated successfully, but these errors were encountered:
Hi, thanks for asking this question, and sorry you're having problems.
HTML and XML parsers, generally speaking, will parse well-formed markup identically, because there's a spec. There is no spec and no formal W3C guidance on how to correct or "fix up" malformed markup, and so every parser seems to do it differently. You're likely seeing differences between your browser's parser and libxml2 (which is what Nokogiri uses). There's nothing we can easily do to change this behavior, unfortunately.
I have to handle an invalid HTML document:
First I'm getting the XPath with a Javascript function like the one firebug used (https://github.com/firebug/firebug/blob/master/extension/content/firebug/lib/xpath.js -> Xpath.getElementTreeXPath).
Using this script I'm getting something like
/p[4]
when I select the lastp
. This is because the browser "fixes" the invalidp
(i.e. the second one with diediv
inside) by adding a closingp
right before thediv
and a second openingp
right after the closingdiv
and just before the closingp
. This results in an extrap
making the lastp
the fourth one - although there are only threep
in the original document.When using the refered Javascript function which uses
previousSibling
the browser let it walk through all thep
(including the extrap
).After I got the XPath (with
/p[4]
) I try to get the content of it by invokingat_xpath
like in the following test:The problem is that Nokogiri simply omits/removes the closing
p
after thediv
instead of fixing it like the browser does (I checked Chrome, Firefox and IE which all act the same way and add an extrap
).While Nokogiri fixes many other mistakes in the markup (https://nokogiri.org/tutorials/ensuring_well_formed_markup.html) I don't understand why Nokogiri acts differently to many of the mainstream browsers in this case. And there is no parse option to change this behavior. Is this a bug or a "feature"?
The text was updated successfully, but these errors were encountered: