OCR Test failure #4

jstuczyn · 2017-06-01T14:20:16Z

This might be consequence of resolving issue #3 as the build would not finish successfully, because it would fail at test stage. The assertion assertTrue(parsedString.contains("Father or mother")) in testParseRequiringOCR inside PDFPreprocessorParserTest.java fails. Rather than being recognised as "Father or mother", the OCR'd document contains the string "Father er mother".

The text was updated successfully, but these errors were encountered:

jstuczyn · 2017-06-15T08:36:10Z

I've compared the outputs produced by:

tesseract 3.03 
with leptonica-1.70

and

tesseract 3.04.01
with leptonica-1.73

And for some reason the output of the first one (the older version) seems to be of better quality, with less misspelled words and less wrongly recognised characters. It might be due to some non-obvious configuration of tesseract in the first case that changed with the fresh installation of newer version.

The test is preventing building cogstack which is due to possible tesseract misconfiguration (#4)

afolarin · 2017-09-27T15:56:52Z

I suspect this is a result of #3
@jstuczyn I think in anycase #30 is likely to close this?

lrog · 2018-11-26T09:05:58Z

Using the new version of Tesseract 4.0 solves this issue -- the text is recognised correctly.
The new version of tesseract is already in dev branch via PR #65 .
Therefore, this test has been once again re-enabled in dev in commit 2c3bdd0.

lrog closed this as completed Nov 26, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OCR Test failure #4

OCR Test failure #4

jstuczyn commented Jun 1, 2017

jstuczyn commented Jun 15, 2017

afolarin commented Sep 27, 2017

lrog commented Nov 26, 2018

OCR Test failure #4

OCR Test failure #4

Comments

jstuczyn commented Jun 1, 2017

jstuczyn commented Jun 15, 2017

afolarin commented Sep 27, 2017

lrog commented Nov 26, 2018