Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Entity resolving in Java should be consistent with C-Nokogiri #704

Closed
jvshahid opened this issue Jun 17, 2012 · 13 comments
Closed

Entity resolving in Java should be consistent with C-Nokogiri #704

jvshahid opened this issue Jun 17, 2012 · 13 comments
Assignees

Comments

@jvshahid
Copy link
Member

This is just a reminder for myself of what needs to be done regarding entity resolving. This started as an investigation for issues #674 and #703.

  1. The Dom Parser handles DTD validation and entity resolving in both the C and Java version (this is still on a branch 'fix_entity_resolving' rev 7272ba1)
  2. The SAX parser doesn't handle validation or entity resolving in either implementation, is that the expected behavior ?
  3. The XML reader doesn't handle DTD validation or entity resolving in Java. (There are broken tests on 'fix_entity_resolving' rev 75a4933)
  4. Investigate what needs to be done for entity resolving in html/xhtml parsers.

Since the changes 'fix_entity_resolving' branch make it possible to get remote DTD, we have to make sure that ParseOptions::NONET is honored. To my knowledge all places check for the nonet option, except the following:

  • Nokogiri::getExternalSubset returns null, would that cause the parser to connect to the internet to resolve the baseURI
@ghost ghost assigned jvshahid Jun 17, 2012
@jvshahid
Copy link
Member Author

@tenderlove, @flavorjones, @yokolet what do you think ?

@yokolet
Copy link
Member

yokolet commented Jun 18, 2012

I'm aware that current Nokogiri test doesn't cover resolving entities problems. I noticed pure Java version doesn't resolve entities well enough. So, I agree with adding more test in such area. I think @jvshahid's fix looks reasonable.

At this moment, fix_entity_resolving branch added 4 failures on 1.8 mode and 5 failures/1 error on 1.9 mode. Am I correct? Before merging to master, we should fix at least 4Fs on 1.8 mode. This is because linux distributers will have a trouble to redistribute pure Java Nokogiri. I'll also try to fix those failures.

@jvshahid Perhaps, we can use Apache Xerces' internal API more. Before, we tried to avoid using such internal API since we didn't want to rely on Xerces too much. However, pure Java Nokogiri needs NekoHtml to parse HTML document and has no other choice besides NekoHtml. NekoHtml has a dependency to Apache Xerces. Perhaps, we don't have a reason not to use Xerces' internal API.

One more thought about using Xerces' internal API. Currently, XInclude implementation has conflict with NekoDTD setting and some of Nokogiri's XInclude related methods don't work. We might need to eliminate NekoDTD.

@jvshahid
Copy link
Member Author

@yokolet thanks for your input. I'm aware of the failures on the new branch, I just wanted to document these inconsistencies in a test but I'll work on fixing them soon.

Regarding Xerces, I like the idea of using XNI. This way we can DRY the code a little bit, by having most of the code using XNI and different implementations for the DOM, SAX and Reader parsers on top of it. It was annoying having to understand the entity resolving problem in the three different contexts (DOM, SAX and Reader) which have different implementations. I don't know if this makes any sense but it's just my initial thoughts and will probably change as I work my way though the code base.

@jvshahid
Copy link
Member Author

@yokolet the broken tests are fixed on the branch. This should fix entity resolving/dtd validation for the XMLReader. Can you take a final look before merging the changes to master.

@yokolet
Copy link
Member

yokolet commented Jun 20, 2012

Awesome! We don't have any failures and errors on JRuby 1.8 mode. Thanks for the hard work.

However, we should rethink the commit 078d400 . This commit added 2 failures, an old failure turned to error on 1.9 mode. This is because 1.9 mode has more tests.

Three F/Es are all related HTML parsing. The commit changed entity resolver, so HTML paring had some trouble.
I'll look at that, but if you have a chance, please review the change. Some users are likely to hit the problem since it is HTML.

@yokolet
Copy link
Member

yokolet commented Jun 20, 2012

One more comment.

Yes, we have individual contexts for DOM, SAX and XMLReader. I don't have clear answer about why we have three. Probably, historical reason is. Pure Java version has a lot to improve implementation in terms of consistency or DRY. As long as change doesn't break any existing tests, also, as long as performance doesn't get worse, we should fix those.

@jvshahid
Copy link
Member Author

We're back to the same number of F/E that we have on master (which should go down once the 19 branch is ready). It turns out that the fix for entity resolving exposed a different bug, we've never used the io that was passed in (since they'll very likely have a path and the code will return an InputStream after setting the path only). The problem will occur if we tried to detect the encoding and then use that io stream to parse the document, because the stream will be missing the first 1K of data. The commit message has more details as well as the comments in NokogiriEncodingReaderWrapper.java.

@yokolet
Copy link
Member

yokolet commented Jun 22, 2012

Wow, that's so embarrassing bug and so nice and clean fix! Looking at your comment, it seems pure Java version has worked barely passing tests. There might be this sort of bug(s) in other area(s).

Huge thanks. The fix_entity_resolving branch should be merged to master. Would you (@jvshahid) merge it?

@jvshahid
Copy link
Member Author

Thanks @yokolet for reviewing the changes. I merged it to master. @flavorjones can we package a release candidate.

@jvshahid
Copy link
Member Author

jvshahid commented Jul 7, 2012

I believe this issue is fixed. Otherwise, please reopen the issue .

@jvshahid jvshahid closed this as completed Jul 7, 2012
@yokolet
Copy link
Member

yokolet commented Jul 8, 2012

Yes, this issue should be closed.

Do you remember the fix was in version 1.5.5? I'll update CHANGELOGs.

@jvshahid
Copy link
Member Author

jvshahid commented Jul 8, 2012

Yes, it should be in 1.5.5

@yokolet
Copy link
Member

yokolet commented Jul 8, 2012

Updated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants