improve ocropy processor #8

bertsky · 2019-04-24T10:40:43Z

I tried applying your Python 3 port / workspace processor of ocropy, but the results were really bad on OCR-D GT. (I always use OCR-D-GT-SEG-LINE on different models, like LatinHist-98000.pyrnn.gz from chreul, incunabula-00184000.pyrnn.gz from GT4HistOCR or fraktur-jze.pyrnn.gz from jze. Those models are reported to achieve CER < 10%, and I can reproduce that when applying them on the test files in GT4HistOCR, which are deskewed, cropped and binarized / flattened properly. But I get CER >> 10% when applying them on our GT with your processor.)

By investigating, I found that many models expect dewarping to be disabled, so I added this option to the processor. Also, the GT data are still raw, so I re-enabled binarization and made the actual method selectable. I got best results when I added the full binarization from ocropus-nlbin, which also includes deskewing. Now the line images extracted from preprocessing look a lot more like the data in GT4HistOCR (although the threshold parameters could be optimised a bit). Recognition results are also much better than before, but I still do not get CER < 10%.

The most prominent difference of our line images is that large segments of the neighbouring lines still appear, so I guess one needs to re-crop after deskewing (but I don't know how to do that). But it might also be that the GT segmentation is the culprit (it seems to be off vertically).

Other improvements here are: error handling, ocropy sanity checks, no temporary files and CER calculation.

I hope you find these useful. Please let me know if you require further changes or splitting into smaller commits. Also, I would be happy to share my measurements and sample images.

review requested – @finkf @wrznr @kba ?

- add proper error handling - use proper temporary files - re-introduce binarization (was commented) - add sanity checks (from ocropy CLI) - make de-warping optional

- abolish temporary files altogether: keep converting between pillow and array formats in memory - make logger available to all functions - make binarization method and dewarping selectable via ocrd-tool parameters - add binarization method from original ocropus-nlbin (including local whitelevel estimation and de-skewing) - calculate OCR-GT distances while processing and show CER per input file

bertsky · 2019-04-24T11:17:12Z

weigel_gnothi02_1618 / page 0001 / TextRegion_1488379719413_342 / tl_21
- line image before this PR:
- line image with nlbin (ocropy binarization and deskewing):
weigel_gnothi02_1618 / page 0001 / TextRegion_1488379733255_361 still shows strange cropping in OCR-D-GT-SEG-LINE:

ocrd_cis/ocropy/recognize.py

finkf · 2019-04-24T11:24:32Z

I was not sure if ocropy should do all the image preprocessing. Shouldn't there be other steps in the ocr-d workflow that improve the image quality for the later ocr-step?

finkf · 2019-04-24T11:33:02Z

I'm OK to merge this

bertsky · 2019-04-24T11:33:10Z

I was not sure if ocropy should do all the image preprocessing. There should be other steps in the ocr-d workflow that improve the image quality for the later ocr-step.

I agree. But where are they? So far I can only see the binarization in ocrd_kraken, but that does no deskewing (which seems to be important with OCR-D GT) and probably ocrd_olena. Both are not useable right now. Also, some steps (like dewarping) depend on the OCR model to use.

bertsky · 2019-04-24T11:37:53Z

Maybe we should open an issue about a workable pipeline for GT – but where?

finkf · 2019-04-24T11:46:38Z

No idea. Maybe core?

bertsky added 3 commits April 23, 2019 18:42

fix requirements

3ad8d55

improve ocrd-cis-ocropy-recognize:

77768ef

- add proper error handling - use proper temporary files - re-introduce binarization (was commented) - add sanity checks (from ocropy CLI) - make de-warping optional

finkf reviewed Apr 24, 2019

View reviewed changes

ocrd_cis/ocropy/recognize.py Outdated Show resolved Hide resolved

finkf merged commit 12ea2f9 into cisocrgroup:dev Apr 25, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

improve ocropy processor #8

improve ocropy processor #8

bertsky commented Apr 24, 2019

bertsky commented Apr 24, 2019 •

edited

Loading

finkf commented Apr 24, 2019 •

edited

Loading

finkf commented Apr 24, 2019

bertsky commented Apr 24, 2019

bertsky commented Apr 24, 2019

finkf commented Apr 24, 2019

improve ocropy processor #8

improve ocropy processor #8

Conversation

bertsky commented Apr 24, 2019

bertsky commented Apr 24, 2019 • edited Loading

finkf commented Apr 24, 2019 • edited Loading

finkf commented Apr 24, 2019

bertsky commented Apr 24, 2019

bertsky commented Apr 24, 2019

finkf commented Apr 24, 2019

bertsky commented Apr 24, 2019 •

edited

Loading

finkf commented Apr 24, 2019 •

edited

Loading