-
-
Notifications
You must be signed in to change notification settings - Fork 905
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[bug] HTML5 foreign element namespaces should not be required by CSS queries #2376
Comments
@danblaker Thanks for opening this issue, and sorry you're having problems. The behavior you're describing here seems to have changed in Nokogumbo at some point between v1.5.0 and v2.0.3 (note that Nokogiri imported Nokogumbo v2.0.5). I wasn't easily able to find the specific commit that changed this behavior because it was challenging to build old versions of Nokogumbo. (Maybe @stevecheckoway knows?) In any case, the document you've provided, when parsed by libgumbo, has a namespace of
|
@flavorjones From what I've been able to dig up, this likely changed in Nokogumbo v2.0.0 as a part of their support for the embedded HTML5 namespaces. While the proposed workaround is functional, I'm still uncertain whether it's desirable. HTML5 has two documented syntaxes: HTML and XML. The XML syntax is not particularly difficult to handle overall, since it's an XML document at its core. The HTML syntax (which is presumably what's handled by Nokogumbo) is different in how it is parsed, processed, and serialized. Part of what's specified in HTML syntax is that So I suppose the "correct" resolution to this issue depends on what Nokogiri's goals are. If Nokogiri is functionally intended to be an XML manipulation tool (with support for HTML -> XML -> HTML processing), the described solution above accurately describes the XML representation of the problem space. If Nokogiri is intended to be a DOM manipulation tool, however, it would seem preferable to adhere to the conventions of the document type. |
This is almost certainly my fault. Namespaces on elements in HTML is tricky. Let me just point out as an aside that the HTML spec doesn't really define how to write HTML documents in XML. It says (emphasis mine),
I completely agree that querying by CSS selector should definitely work as you expect. I.e., it should follow the rules for CSS. I have no idea what the "correct" behavior for XPATH querying should be though. HTML documents use a modified XPATH 1.0. This change is incompatible with XPATH 1.0. (There are similar changes to XSLT.) This suggests that we'd need different behavior depending on whether the document is an XML document or an HTML document. One idea (and I think the original Nokogumbo implementation, until I changed it), is to not have foreign nodes live in a namespace. This violates the HTML standard and thus I thought it important to not do that. I took your example document and opened it in Firefox and ran document.querySelector("rect").namespaceURI in the development console. As expected, I converted your example to an XHTML document. <html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta charset="utf-8" />
<meta name="viewport" content="width=device-width, initial-scale=1" />
<title>svg test</title>
</head>
<body>
<div id="svg-container">
<svg version="1.1" width="300" height="200" xmlns="http://www.w3.org/2000/svg">
<rect width="100%" height="100%" fill="red" />
<circle cx="150" cy="100" r="80" fill="green" />
<text x="150" y="125" font-size="60" text-anchor="middle" fill="white">SVG</text>
</svg>
</div>
</body>
</html> With both the HTML syntax and the XHTML syntax, the following code has the same behavior. document.querySelector("rect") // returns the rect
document.evaluate('//rect', document.body).iterateNext() // returns null
document.evaluate('//svg:rect', document.body, (prefix) => prefix === 'svg' && "http://www.w3.org/2000/svg").iterateNext() // returns the rect So the first and third return the rect and the second returns null. I think that suggests the correct way forward here. When operating on an HTML document (which XHTML is, as confirmed by
I'm not familiar with how Nokogiri implements @flavorjones Any idea how difficult it would be to change |
👋 Thanks for jumping in here @stevecheckoway.
Your path of reasoning leading to this suggestion feels very reasonable to me, and I can easily get behind this as the "right" behavior. I'm curious to head what @pvande and @danblaker think!
CSS queries are converted into XPath queries by
I need to explore a bit to figure out the best option, but we do have the ability to modify (or augment) the compilation process for CSS queries. Probably this is the best bet: to compile CSS queries into XPath that ignores the namespace of the node and selects only on the bare node name. |
I agree with the approach @stevecheckoway describes. If Nokogiri's selector methods on an HTML5 doc return identical results compared to selectors in the Firefox console for the same doc, I think that would meet the expectation of any user trying to parse HTML5 with Nokogiri. |
@danblaker's said it best — my expectation is parity with browsers. Thanks for digging in here, @stevecheckoway! |
OK, I'll play around with it a bit this week. In the meantime, you may want to try to emulate this mode by calling Note that this approach affects both CSS and XPath queries because it's nuked all the namespace declarations and references from the document: # before
doc.at_css('svg', "xmlns" => "http://www.w3.org/2000/svg")
doc.at_xpath('//xmlns:svg', "xmlns" => "http://www.w3.org/2000/svg")
doc.remove_namespaces!
# after
doc.at_css('svg')
doc.at_xpath('//svg') |
Would love a 👍 or a clarifying comment on these tests, which is what I'm going to try to make pass on my feature branch: describe "HTML5 foreign elements" do
let(:html) { <<~HTML }
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1">
<title>svg test</title>
</head>
<body>
<div id="svg-container">
<svg version="1.1" width="300" height="200" xmlns="http://www.w3.org/2000/svg">
<rect width="100%" height="100%" fill="red" />
<circle cx="150" cy="100" r="80" fill="green" />
<text x="150" y="125" font-size="60" text-anchor="middle" fill="white">SVG</text>
</svg>
</div>
</body>
</html>
HTML
let(:ns) { { "xmlns" => "http://www.w3.org/2000/svg" } }
describe "in an XML doc" do
let(:doc) { Nokogiri::XML::Document.parse(html) }
it "requires namespace in XPath queries" do
assert_empty(doc.xpath("//svg"))
refute_empty(doc.xpath("//xmlns:svg", ns))
end
it "requires namespace in CSS queries" do
assert_empty(doc.css("svg"))
refute_empty(doc.css("xmlns|svg", ns))
end
end
describe "in an HTML4 doc" do
let(:doc) { Nokogiri::HTML4::Document.parse(html) }
it "omits namespace in XPath queries" do
refute_empty(doc.xpath("//svg"))
assert_empty(doc.xpath("//xmlns:svg", ns))
end
it "omits namespace in CSS queries" do
refute_empty(doc.css("svg"))
assert_empty(doc.css("xmlns|svg", ns))
end
end
describe "in an HTML5 doc" do
let(:doc) { Nokogiri::HTML5::Document.parse(html) }
it "requires namespace in XPath queries" do
assert_empty(doc.xpath("//svg"))
refute_empty(doc.xpath("//xmlns:svg", ns))
end
it "omits namespace in CSS queries" do
refute_empty(doc.css("svg"))
assert_empty(doc.css("xmlns|svg", ns))
end
end
end Note that the only failing test right now is |
@flavorjones My only caution about this change is that I don't actually know what the correct HTML4 behavior should be. I suspect your proposal is correct, but IIRC Regardless, the behavior may match the expectation. |
I'm 👍 but I don't think |
@stevecheckoway Sorry, do you mean the use of "xmlns" in the test? If so, I'm happy to change because in the context of the test it's an arbitrary prefix (and supported by the underlying libraries despite its reserved status). |
i.e., this change is what you're proposing? diff --git a/test/xml/test_xpath.rb b/test/xml/test_xpath.rb
index 8a898f9..3a185b7 100644
--- a/test/xml/test_xpath.rb
+++ b/test/xml/test_xpath.rb
@@ -628,19 +628,19 @@ def collision(nodes)
</html>
HTML
- let(:ns) { { "xmlns" => "http://www.w3.org/2000/svg" } }
+ let(:ns) { { "nsfoo" => "http://www.w3.org/2000/svg" } }
describe "in an XML doc" do
let(:doc) { Nokogiri::XML::Document.parse(html) }
it "requires namespace in XPath queries" do
assert_empty(doc.xpath("//svg"))
- refute_empty(doc.xpath("//xmlns:svg", ns))
+ refute_empty(doc.xpath("//nsfoo:svg", ns))
end
it "requires namespace in CSS queries" do
assert_empty(doc.css("svg"))
- refute_empty(doc.css("xmlns|svg", ns))
+ refute_empty(doc.css("nsfoo|svg", ns))
end
end
@@ -649,12 +649,12 @@ def collision(nodes)
it "omits namespace in XPath queries" do
refute_empty(doc.xpath("//svg"))
- assert_empty(doc.xpath("//xmlns:svg", ns))
+ assert_empty(doc.xpath("//nsfoo:svg", ns))
end
it "omits namespace in CSS queries" do
refute_empty(doc.css("svg"))
- assert_empty(doc.css("xmlns|svg", ns))
+ assert_empty(doc.css("nsfoo|svg", ns))
end
end
@@ -663,12 +663,12 @@ def collision(nodes)
it "requires namespace in XPath queries" do
assert_empty(doc.xpath("//svg"))
- refute_empty(doc.xpath("//xmlns:svg", ns))
+ refute_empty(doc.xpath("//nsfoo:svg", ns))
end
it "omits namespace in CSS queries" do
refute_empty(doc.css("svg"))
- assert_empty(doc.css("xmlns|svg", ns))
+ assert_empty(doc.css("nsfoo|svg", ns))
end
end
end |
Yeah, that's precisely what I was suggesting. |
OK, I've got a branch that allows the XPathVisitor class (which is doing the translation from CSS selector to XPath query) to know whether it's operating on an HTML5 document. I cleaned up some rough edges along the way and feel pretty good about that code. However, now I'm faced with the decision about what XPath to use. I believe the right choice is to use #! /usr/bin/env ruby
require "nokogiri"
require "benchmark/ips"
doc = Nokogiri::HTML5::Document.parse(File.read("test/files/tlm.html"))
q_element_name = "//span"
q_local_name = "//*[local-name()='span']"
raise("queries are not equivalent") unless doc.xpath(q_local_name).length == doc.xpath(q_element_name).length
Benchmark.ips do |x|
x.report("element name") do
doc.xpath(q_element_name).length
end
x.report("local name") do
doc.xpath(q_local_name).length
end
x.compare!
end
That is, the I may play with a builtin function implemented in C to speed this up; but if anyone has other ideas please comment! |
Sigh, libxml2 already has this implemented pretty efficiently as Is everybody OK taking a 10x performance hit on CSS queries in HTML5 docs? If not, what alternatives am I not thinking of? |
Ouch! 10x performance hit is pretty significant. Any idea what makes it so much slower? |
Plain element name matching is inlined as part of xpath evaluation: But calling an xpath function introduces that function call overhead, as well as internal stack management overhead; plus unnecessary string allocations are being made; and all that overhead is invoked for every element in the document: Looking at these graphs, though, the memory allocation component seems to dominate, and I bet that we could avoid that with a builtin that did the comparison without returning strings (similar to what I did with the builtin |
Following up: writing a C builtin improved things slightly, from ~10x slower to ~6x slower. The xpath function overhead and memory allocation is still pretty noticeable: So then I dug into libxml2 and was able to hack in limited namespace wildcarding, which is almost as fast as a standard child-axis search, about 1.1x slower. Here's the stack analysis for posterity: and here's the benchmark:
|
So I'm about to submit a PR that I'm hoping to get into the next release that will use:
|
PR is #2403 |
PR has been merged, this will be fixed in v1.13.0 (release is imminent) |
After upgrading to Nokogiri v1.12.0, I am unable to target inline svg elements via css or xpath selectors. Sample console commands:
Expected behavior
Using the previous versions of Nokogiri (v 1.10.10) + nokogumbo (1.5.0), I was able to target SVG elements without any problems.
Environment
The text was updated successfully, but these errors were encountered: