-
-
Notifications
You must be signed in to change notification settings - Fork 904
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[bug] Node#to_xhtml does not have self-closing tags when using libxml 2.9.11 and later #2324
Comments
Hi @haxxxton, thanks for opening this issue, and sorry you're seeing something unexpected. Without working code, I can't confidently diagnose what you're seeing. However, v1.11.4 upgraded libxml2 from 2.9.10 to 2.9.12 which may have introduced some behavior changes in serialization. My guess is that you're using Because Nokogiri is just a thin wrapper around the underlying parser, there's nothing we can easily to do change this behavior in the gem. The best thing to do might be to send a bug report upstream and see what the maintainers say -- perhaps this was an unintended side effect of that commit. |
@flavorjones thanks for the speedy reply, apologies for missing the code block that actually calls nokogiri. You are correct, we are using
As an aside, all the docs and code i can see suggest that i should be passing NOBLANKS as the 4th arg into HTML5, but when i do i get an error:
Is this the correct place to be putting NOBLANKS? is my environment potentially still looking at an older version of nokogiri? |
@haxxxton Ugh, so I spent some more time looking at this this morning, and the good news is that I have a workaround for you. First, a bit of explanation and context-setting. Here's a script that reproduces the issue: #! /usr/bin/env ruby
require 'nokogiri'
doc = Nokogiri::HTML("<html><body><table><colgroup><col span='2'>")
puts doc.to_xhtml With Nokogiri v1.10.10 using libxml 2.9.10, the output is: <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<body>
<table>
<colgroup>
<col span="2" />
</colgroup>
</table>
</body>
</html> With Nokogiri v1.12.3 using libxml 2.9.12, the output is: <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html>
<body>
<table>
<colgroup>
<col span="2"></col>
</colgroup>
</table>
</body>
</html> The method which says: DEFAULT_XHTML = FORMAT | NO_DECLARATION | NO_EMPTY_TAGS | AS_XHTML The It appears as though https://gitlab.gnome.org/GNOME/libxml2/-/commit/dc6f009 has fixed that bug and is now respecting that flag when it's passed in. Ugh! So here's the workaround: pass in your own SaveOptions for now, like this: #! /usr/bin/env ruby
require 'nokogiri'
save_options = Nokogiri::XML::Node::SaveOptions::DEFAULT_XHTML & ~Nokogiri::XML::Node::SaveOptions::NO_EMPTY_TAGS
doc = Nokogiri::HTML("<html><body><table><colgroup><col span='2'>")
puts doc.to_xhtml(save_with: save_options) I will get a fix into the next patch release of Nokogiri that omits this flag to restore the old behavior by default. I'm going to comment on the upstream issue you opened (thank you for that!) letting them know that this is a Nokogiri issue, but giving some of this additional context. |
Also, re: your NOBLANKS question, the https://nokogiri.org/tutorials/parsing_an_html_xml_document.html#parse-options HTML5 parsing was taken from Nokogumbo and is explained here: https://nokogiri.org/tutorials/parsing_an_html5_document.html#parsing-options TL;DR there aren't as many options available in gumbo as there are in libxml2. |
@flavorjones wow! such a huge amount of investigation! thank you so much, and for providing a work around. My diffs on my files now look like:
(Note the missing space between the self-closing element name). This is absolutely not an issue, but before i go and rerun all my snapshots to use this new format, do you anticipate the patch fix will go back to the style with the space (pre 1.11.4), or without (workaround)? Regarding the above reply to the parsing options stuff, am i reading this wrong:
These three methods appear to accept 4 args (and, while im not using a valid option), im getting the error that ive providing too many arguments
|
@haxxxton The trailing space in self-closing tags appears to be gone for good in libxml 2.9.11 and later, I'd suggest adopting that in any tests you have. (As an aside, I've found it's more robust to test semantic aspects of the markup whenever possible, instead of comparing the serialized text string.) Regarding the HTML5 method signatures (which are a very recent addition to Nokogiri via merger with Nokogumbo, and we do need to spruce up the API docs as well as the tutorial), the final argument must be a Hash of options. So this won't work: Nokogiri::HTML5("<div>", nil, nil, 1234) but these will be fine: puts Nokogiri::HTML5("<div>", nil, nil, max_errors: 1234)
puts Nokogiri::HTML5("<div>", nil, max_errors: 1234)
puts Nokogiri::HTML5("<div>", max_errors: 1234) Hope that helps! |
Commit 1d06b4f introduced NO_EMPTY_TAGS into SaveOptions::DEFAULT_XHTML which libxml2 ignored due to a long-standing bug in serialization. libxml2 v2.9.11 fixed that serialization bug (https://gitlab.gnome.org/GNOME/libxml2/-/commit/dc6f009) and started paying attention to the NO_EMPTY_TAGS save option, resulting in seeing output containing, e.g. `<col></col>` instead of `<col/>`. This commit updates the default XHTML save options to drop the NO_EMPTY_TAGS flag, restoring this behavior. Closes #2324
This is now fixed on |
What problem are you trying to solve?
We have been using nokogiri in combination with approvals files in our rspecs. We've just version bumped up to 1.11.4 and found that all our self-closing tags have changed to include closing tags. Is there a configuration option im missing, or is this the way things should be now? (is this changed in a later version is should move to more recently?)
Please show your code!
Note the
<col />
to<col></col>
Environment
The text was updated successfully, but these errors were encountered: