-
-
Notifications
You must be signed in to change notification settings - Fork 904
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Entity resolving in Java should be consistent with C-Nokogiri #704
Comments
@tenderlove, @flavorjones, @yokolet what do you think ? |
I'm aware that current Nokogiri test doesn't cover resolving entities problems. I noticed pure Java version doesn't resolve entities well enough. So, I agree with adding more test in such area. I think @jvshahid's fix looks reasonable. At this moment, fix_entity_resolving branch added 4 failures on 1.8 mode and 5 failures/1 error on 1.9 mode. Am I correct? Before merging to master, we should fix at least 4Fs on 1.8 mode. This is because linux distributers will have a trouble to redistribute pure Java Nokogiri. I'll also try to fix those failures. @jvshahid Perhaps, we can use Apache Xerces' internal API more. Before, we tried to avoid using such internal API since we didn't want to rely on Xerces too much. However, pure Java Nokogiri needs NekoHtml to parse HTML document and has no other choice besides NekoHtml. NekoHtml has a dependency to Apache Xerces. Perhaps, we don't have a reason not to use Xerces' internal API. One more thought about using Xerces' internal API. Currently, XInclude implementation has conflict with NekoDTD setting and some of Nokogiri's XInclude related methods don't work. We might need to eliminate NekoDTD. |
@yokolet thanks for your input. I'm aware of the failures on the new branch, I just wanted to document these inconsistencies in a test but I'll work on fixing them soon. Regarding Xerces, I like the idea of using XNI. This way we can DRY the code a little bit, by having most of the code using XNI and different implementations for the DOM, SAX and Reader parsers on top of it. It was annoying having to understand the entity resolving problem in the three different contexts (DOM, SAX and Reader) which have different implementations. I don't know if this makes any sense but it's just my initial thoughts and will probably change as I work my way though the code base. |
@yokolet the broken tests are fixed on the branch. This should fix entity resolving/dtd validation for the XMLReader. Can you take a final look before merging the changes to master. |
Awesome! We don't have any failures and errors on JRuby 1.8 mode. Thanks for the hard work. However, we should rethink the commit 078d400 . This commit added 2 failures, an old failure turned to error on 1.9 mode. This is because 1.9 mode has more tests. Three F/Es are all related HTML parsing. The commit changed entity resolver, so HTML paring had some trouble. |
One more comment. Yes, we have individual contexts for DOM, SAX and XMLReader. I don't have clear answer about why we have three. Probably, historical reason is. Pure Java version has a lot to improve implementation in terms of consistency or DRY. As long as change doesn't break any existing tests, also, as long as performance doesn't get worse, we should fix those. |
We're back to the same number of F/E that we have on master (which should go down once the 19 branch is ready). It turns out that the fix for entity resolving exposed a different bug, we've never used the io that was passed in (since they'll very likely have a path and the code will return an InputStream after setting the path only). The problem will occur if we tried to detect the encoding and then use that io stream to parse the document, because the stream will be missing the first 1K of data. The commit message has more details as well as the comments in NokogiriEncodingReaderWrapper.java. |
Wow, that's so embarrassing bug and so nice and clean fix! Looking at your comment, it seems pure Java version has worked barely passing tests. There might be this sort of bug(s) in other area(s). Huge thanks. The fix_entity_resolving branch should be merged to master. Would you (@jvshahid) merge it? |
Thanks @yokolet for reviewing the changes. I merged it to master. @flavorjones can we package a release candidate. |
I believe this issue is fixed. Otherwise, please reopen the issue . |
Yes, this issue should be closed. Do you remember the fix was in version 1.5.5? I'll update CHANGELOGs. |
Yes, it should be in 1.5.5 |
Updated. |
This is just a reminder for myself of what needs to be done regarding entity resolving. This started as an investigation for issues #674 and #703.
Since the changes 'fix_entity_resolving' branch make it possible to get remote DTD, we have to make sure that ParseOptions::NONET is honored. To my knowledge all places check for the nonet option, except the following:
The text was updated successfully, but these errors were encountered: