-
-
Notifications
You must be signed in to change notification settings - Fork 905
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
version > 1.4.4 produces duplicate elements when using Nokogiri::HTML with an invalid HTML doc #478
Comments
Hello! Thanks for asking this question! However, without more information, Please provide us with:
For more information about requesting help or reporting bugs, please Thank you so much! |
Sorry, locally:
https://gist.github.com/1042949 is a simple test describing the behavior, and it failed locally, on heroku, and another host. Note that visual inspection of the HTML here shows that Heroku's
The other host:
|
I wrote a test that reproduces this issue. Test is here: https://gist.github.com/1049877 |
I bisected the issue using the test above. Here is my bisection result: $ git bisect bad
:100644 100644 b521ce3a023941680e78ea2ed9426862d2bc7803 e4b30b3efed08406e18267acd997b2db092fd338 M CHANGELOG.rdoc |
@knu -- please take a look at 984a554 -- Nokogiri::XML::Document#read_io silently discards IO errors in order to avoid a memory leak. It looks like c39eb4e needs IO exceptions when reading HTML files. One option is to rewrite c39eb4e so that it doesn't require IO exceptions. Another option is to rewrite 984a554 so that exceptions can occur without leaking memory. |
I reverted 984a554 and my test above still fails. Guess it isn't caused by the swallowing of exceptions after all (but do keep that in mind!). |
OK, I'll look into this later today. |
I submitted a pull request that fixes the issue here: https://github.com/tenderlove/nokogiri/pull/481 |
@ender672 thanks for the followup. |
When using version 1.4.4, the following produces the correct results:
1.4.5 yielded duplicates, as well as 1.4.6. I did not try 1.4.4.1 or 1.4.4.2.
I suspect it has to do with validity; the page does not produce valid HTML because it lacks
<html>
and<body>
tags.Source is here: http://pastie.org/2111018
The text was updated successfully, but these errors were encountered: