-
-
Notifications
You must be signed in to change notification settings - Fork 905
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[bug] The strict
parsing mode does not raise for invalid HTML documents
#2130
Comments
@dgollahon Thank you for submitting such a clear issue for this! And I'm sorry that you're having trouble. I'll get some time later today to reproduce what you're seeing and dig in. My gut is that this is libxml2 acting inconsistently, but I will definitely dig in and tell you what I find. |
@dgollahon Thanks for your patience. I spent some time tonight looking into this, and I'll try to explain what's going on. TL;DR: Yeah, this is a bug in the CRuby implementation (but not the JRuby one!). CRuby / libxml2 HTMLparser.cThe logic used in the CRuby/libxml2 implementation for raising an exception on HTML parse errors is as follows:
Libxml2's HTML parser considers almost nothing to be fatal and nearly always recovers -- regardless of what parse options are passed in. The In contrast, if we look at libxml2's XML parser in xmlParseDocument(ctxt);
if ((ctxt->wellFormed) || ctxt->recovery)
ret = ctxt->myDoc;
else {
ret = NULL;
/* ... some code elided ... */
}
/* ... some code elided ... */
return (ret); That is, unless the parser found zero errors ( JRuby / NekoHTMLHere's where it gets messy. The JRuby implementation does not rely on the return value from the parse function to determine success, and instead uses different error handlers (strict or non-strict) to control whether to raise an exception. As a result, it behaves consistently across both XML and HTML parsers, and matches what your expectations are in this issue. Why has nobody reported this before?Why has nobody reported this before? I think the honest truth is that there is so much broken HTML out there in the world that the overwhelming majority of use cases need the default behavior -- to act like a browser and try to recover from parse errors as best we can. Put another way, strict parsing of an arbitrary HTML document from the internet is highly likely to fail. How do we resolve this?I think the right thing to do here is to change the behavior of the HTML::Document parse functions to raise an exception if any parse error is encountered; but this is likely to be a breaking change for some folks and so I want to think hard about putting it into a patch release. Here are some tests that all pass on JRuby: require "helper"
class Test2130 < Nokogiri::TestCase
describe "HTML::Document.parse" do
describe "read memory" do
let(:input) { "<html><body><div" }
describe "strict parsing" do
let(:parse_options) { Nokogiri::XML::ParseOptions.new.strict }
it "raises exception on parse error" do
assert_raises Nokogiri::SyntaxError do
Nokogiri::HTML.parse(input, nil, nil, parse_options)
end
end
end
describe "default options" do
it "does not raise exception on parse error" do
doc = Nokogiri::HTML.parse(input)
assert_operator(doc.errors.length, :>, 0)
end
end
end
describe "read io" do
let(:input) { StringIO.new("<html><body><div") }
describe "strict parsing" do
let(:parse_options) { Nokogiri::XML::ParseOptions.new.strict }
it "raises exception on parse error" do
assert_raises Nokogiri::SyntaxError do
Nokogiri::HTML.parse(input, nil, nil, parse_options)
end
end
end
describe "default options" do
it "does not raise exception on parse error" do
doc = Nokogiri::HTML.parse(input)
assert_operator(doc.errors.length, :>, 0)
end
end
end
end
end But fail on CRuby in embarrasingly bad ways -- including
|
Thanks for the update! Interesting that it works on JRuby. Regardless of the fix path, it seems to me that MRI and JRuby should have the same behavior. Understood re: broken HTML and the risk of breaking users. I am sure some people like me have |
I agree and I'll try to get this into the next release.
Neither libxml2 nor Nokogiri has ever behaved any differently. It's been like this since the first release of Nokogiri. |
This results in (different) failing tests in JRuby and CRuby. Related to #2130
The JRuby HTML parser was inconsistent with the CRuby implementation, and didn't use the common-sense interpretation of "norecover". Related to #2130
If the RECOVER parse option is not set, then the HTML.parse (and friends) will now raise an exception corresponding to the first error encountered. We do this in read_io and read_memory because libxml2's HTML parser does not obey RECOVER/NORECOVER. Closes #2130
If the RECOVER parse option is not set, then the HTML.parse (and friends) will now raise an exception corresponding to the first error encountered. We do this in read_io and read_memory because libxml2's HTML parser does not obey RECOVER/NORECOVER. Closes #2130
PR #2150 will address this. |
This results in (different) failing tests in JRuby and CRuby. Related to #2130
The JRuby HTML parser was inconsistent with the CRuby implementation, and didn't use the common-sense interpretation of "norecover". Related to #2130
…mode HTML "strict" mode parsing --- **What problem is this PR intended to solve?** There are quite a few things being addressed in this PR, but the most significant bit is consistently obeying the `recover` parse option for HTML documents. As #2130 points out, the CRuby/libxml2 implementation was completely ignoring the `recover` parse option (also know as the `strict` option) for HTML documents. This PR makes the `recover` option behave identically for HTML documents as it does for XML documents. Related, though, the JRuby implementation was incorrectly applying the `recover` parse option in HTML documents, instead using `noerror` and `nowarning` to determine whether to raise an exception. This has been brought in line with the behavior described above. Also related, the `EncodingReader` class which detects encoding in HTML documents was silently swallowing document syntax errors if they occurred in the first "chunk" read from an IO object. This has also been fixed. This PR also introduces some smaller changes: - make JRuby and CRuby implementations consistent in how they handle comparing `XML::Node` to an `XML::Document` using the `<=>` operator - introduce minitest-reporters to the main test suite - skip some irrelevant tests, and restructure others to be faster - fix the annoying JRuby encoding test that fails on some Java JDKs - eating away at the edges of code formatting as I touched some of these rarely-touched files **Have you included adequate test coverage?** Yes. **Does this change affect the behavior of either the C or the Java implementations?** Yes, this changes the behavior of the `strict` (a.k.a. `norecover`) parse option in parsing HTML documents. I've chosen to classify this as a "bugfix that happens to change behavior that people may be depending upon" and so am introducing potentially breaking behavior into a minor release. Apologies to anyone who's inconvenienced by that.
Please describe the bug
Hi,
According to these docs it should be possible to enter a strict parsing mode that raises on malformed XML. This works for XML. It seems implied that
strict
can be set the same as withNokogiri::XML
, here. Unfortunately, I can't figure out anything that will causeNokogiri::HTML
with strict config to raise.Hopefully I have not missed any documentation or open issues related to this, but all I ran across was #1806 which was closed for lack of detail. If this is the intended behavior, I think the differences between XML and HTML parse options should be documented more clearly.
Help us reproduce what you're seeing
Expected behavior
I would expect
Nokogiri
to raise a syntax error for HTML documents but nothing is raised, unlike with the XML parsing mode. The#errors
array seems to be populated in either case.Environment
Additional context
I am writing a small scraper and I would like to bail if pages I am trying to parse are corrupt/invalid.
Thanks!
The text was updated successfully, but these errors were encountered: