Skip to content
This repository has been archived by the owner on Jan 13, 2023. It is now read-only.

OCR Test failure #4

Closed
jstuczyn opened this issue Jun 1, 2017 · 3 comments
Closed

OCR Test failure #4

jstuczyn opened this issue Jun 1, 2017 · 3 comments

Comments

@jstuczyn
Copy link
Contributor

jstuczyn commented Jun 1, 2017

This might be consequence of resolving issue #3 as the build would not finish successfully, because it would fail at test stage. The assertion assertTrue(parsedString.contains("Father or mother")) in testParseRequiringOCR inside PDFPreprocessorParserTest.java fails. Rather than being recognised as "Father or mother", the OCR'd document contains the string "Father er mother".

@jstuczyn
Copy link
Contributor Author

I've compared the outputs produced by:

tesseract 3.03 
with leptonica-1.70

and

tesseract 3.04.01
with leptonica-1.73

And for some reason the output of the first one (the older version) seems to be of better quality, with less misspelled words and less wrongly recognised characters. It might be due to some non-obvious configuration of tesseract in the first case that changed with the fresh installation of newer version.

jstuczyn added a commit that referenced this issue Jun 15, 2017

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.
The test is preventing building cogstack which is due to possible tesseract misconfiguration (#4)
@afolarin
Copy link
Contributor

I suspect this is a result of #3
@jstuczyn I think in anycase #30 is likely to close this?

@lrog
Copy link
Contributor

lrog commented Nov 26, 2018

Using the new version of Tesseract 4.0 solves this issue -- the text is recognised correctly.
The new version of tesseract is already in dev branch via PR #65 .
Therefore, this test has been once again re-enabled in dev in commit 2c3bdd0.

@lrog lrog closed this as completed Nov 26, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants