-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Page.getPlainText broken - PlainTextConverter struggles to discriminate candidate methods and ends in 'VisitorException' #160
Comments
I can confirm, that this also affects Windows 10 - Stacktrace is similar to the one posted by @mawiesne in a Java 8 environment (Oracle / OpenJDK does not matter) |
can someone post a bit of code that generates the stacktrace above ? |
with this implementation as
Will output:
|
Can you help me understand something in your code ? Looking at the |
Basically: 1.) Create a connection to a database. In our case: a MySQL DB containing the Wikipedia Dumps and therefore the wikipedia pages. 2.) I left out the real credentials ;) 3.) Retrieve a page of interest (it does not matter which one). 4.) Try to retrieve the full text via |
gotcha, sorry for being a pain, cause i use this in the context of json wikipedia. Is the Mysql database populated by downloading and importing sql files from here https://dumps.wikimedia.org/enwiki/20180320/ (if so could you let me know which) or is there a transformation from the full xml dump into sql that is done by some cli tool in advance ? |
We make use of the The resulting files are then imported into a MySQL 5.7 installation. For a German version of Wikipedia dumps, we basically use:
as given in the examples section of the how-to. |
Cool, could I get the exact command you guys used to produce the german (or any other language) dump (I will try to replicate the bug and see if there's an easy fix). |
I updated the code-snippet above to not use internal classes / provided related code to execute it. |
@tgalery Thanks a ton for looking into this! I will upload a dump of a transformed version of the German wikipedia DB dating Jan 2018. Stay tuned, next comment with instructions will follow shorty. |
@mawiesne that would be extremely helpful |
@tgalery Download one or both of the two mysql dumps from here:
Re-Import them on your local dev-machine via:
Same procedure with smaller Spanish (es) version, just exchange 'de' with 'es'. When you decide to use es, you could, for instance, fetch a page such as "Salud". |
Cheers, will give you guys an update as soon as I can. |
Some upates. I've been trying to debug this using the Spanish dump as it's slightly smaller.
Is there something wrong with the spanish dump I downloaded above ? |
Commenting out hibernate auto validation gives me this:
so ... maybe there's something funny with the dump ? |
@tgalery I think I know what went wrong, and I'll provide two modified/fresh dumps on next Monday. UPDATE:
Remove all previous files / imported DBs and conduct a re-import. It should work now as I've dumped it from one of our production systems in which no DB schema errors are present. Again, sry for any inconveniences. |
It seems to be a problem with the reflection code in Line 361ff
Both classes
extend the same interface classes, which leads to this issue. |
At Heilbronn University Group we managed to reproduce this bug with the existing test-cases
CI did not complain because of #161 |
Cool, I'm assuming this will be reproducible once #162 gets merged ? |
Yes |
@tgalery Any updates here? :) |
…iscriminate candidate methods and ends in 'VisitorException' - Fixes this issue by commenting out unused/empty candidate method `public void visit(WtNode n)`. - Un-ignores and adapts test cases in `PageTest`. This way, `testGetPlainText` can now work correctly. No more ignored tests \o/ - Adds minor fix in `PlainTextConverter` to parse/handle standalone line breaks strings correctly. - Adapts demo data to the version from 2010 (initial DB import had "*" and multi-line breaks). I screwed them up slightly in June 2018 when bringing #2 to master. - Simplifies 2/3 `PlainTextConverter` constructors to reduce duplicate code. Moreover, this kind of fixes #161, as no other problems remain once this commit is merged.
…iscriminate candidate methods and ends in 'VisitorException' - Addresses LF problems in `PlainTextConverter` on Windows platform, now related test passes. thx for helping: @rzo1
…-VisitorException #160 - Page.getPlainText broken - `PlainTextConverter` struggles to discriminate candidate methods and ends in 'VisitorException'
Finally fixed via PR #185 |
With the introduction of Swebble 3.1.7 to the JWPL 1.2.0-SNAPSHOT line, I can no longer fetch plain text data from Wikipedia backends via
Page.getPlainText
. The stacktrace is documented here:It seems there is a mismatch of method signatures and/or incompatible libraries being used at runtime. I consider this a major bug, as parts of the main functionality are affected. Therefore, this bug should be fixed before releasing JWPL 1.2.0 (Final).
Dependencies involved:
System environment:
Any ideas @ferschke / @reckart ? Can somebody contact the colleagues at FAU Erlangen to investigate this issue?
The text was updated successfully, but these errors were encountered: