Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

improve ocropy processor #8

Merged
merged 3 commits into from
Apr 25, 2019
Merged

improve ocropy processor #8

merged 3 commits into from
Apr 25, 2019

Conversation

bertsky
Copy link
Collaborator

@bertsky bertsky commented Apr 24, 2019

I tried applying your Python 3 port / workspace processor of ocropy, but the results were really bad on OCR-D GT. (I always use OCR-D-GT-SEG-LINE on different models, like LatinHist-98000.pyrnn.gz from chreul, incunabula-00184000.pyrnn.gz from GT4HistOCR or fraktur-jze.pyrnn.gz from jze. Those models are reported to achieve CER < 10%, and I can reproduce that when applying them on the test files in GT4HistOCR, which are deskewed, cropped and binarized / flattened properly. But I get CER >> 10% when applying them on our GT with your processor.)

By investigating, I found that many models expect dewarping to be disabled, so I added this option to the processor. Also, the GT data are still raw, so I re-enabled binarization and made the actual method selectable. I got best results when I added the full binarization from ocropus-nlbin, which also includes deskewing. Now the line images extracted from preprocessing look a lot more like the data in GT4HistOCR (although the threshold parameters could be optimised a bit). Recognition results are also much better than before, but I still do not get CER < 10%.

The most prominent difference of our line images is that large segments of the neighbouring lines still appear, so I guess one needs to re-crop after deskewing (but I don't know how to do that). But it might also be that the GT segmentation is the culprit (it seems to be off vertically).

Other improvements here are: error handling, ocropy sanity checks, no temporary files and CER calculation.

I hope you find these useful. Please let me know if you require further changes or splitting into smaller commits. Also, I would be happy to share my measurements and sample images.

review requested – @finkf @wrznr @kba ?

bertsky added 3 commits April 23, 2019 18:42
- add proper error handling
- use proper temporary files
- re-introduce binarization (was commented)
- add sanity checks (from ocropy CLI)
- make de-warping optional
- abolish temporary files altogether:
  keep converting between pillow and array
  formats in memory
- make logger available to all functions
- make binarization method and dewarping
  selectable via ocrd-tool parameters
- add binarization method from original
  ocropus-nlbin (including local whitelevel
  estimation and de-skewing)
- calculate OCR-GT distances while processing
  and show CER per input file
@bertsky
Copy link
Collaborator Author

bertsky commented Apr 24, 2019

  • weigel_gnothi02_1618 / page 0001 / TextRegion_1488379719413_342 / tl_21

    • line image before this PR:
      tl_21 raw
    • line image with nlbin (ocropy binarization and deskewing):
      tl_21 nlbin
  • weigel_gnothi02_1618 / page 0001 / TextRegion_1488379733255_361 still shows strange cropping in OCR-D-GT-SEG-LINE:

ocrd-tesserocr_line_1488379733304_363

ocrd_cis/ocropy/recognize.py Outdated Show resolved Hide resolved
@finkf
Copy link
Contributor

finkf commented Apr 24, 2019

I was not sure if ocropy should do all the image preprocessing. Shouldn't there be other steps in the ocr-d workflow that improve the image quality for the later ocr-step?

@finkf
Copy link
Contributor

finkf commented Apr 24, 2019

I'm OK to merge this

@bertsky
Copy link
Collaborator Author

bertsky commented Apr 24, 2019

I was not sure if ocropy should do all the image preprocessing. There should be other steps in the ocr-d workflow that improve the image quality for the later ocr-step.

I agree. But where are they? So far I can only see the binarization in ocrd_kraken, but that does no deskewing (which seems to be important with OCR-D GT) and probably ocrd_olena. Both are not useable right now. Also, some steps (like dewarping) depend on the OCR model to use.

@bertsky
Copy link
Collaborator Author

bertsky commented Apr 24, 2019

Maybe we should open an issue about a workable pipeline for GT – but where?

@finkf
Copy link
Contributor

finkf commented Apr 24, 2019

No idea. Maybe core?

@finkf finkf merged commit 12ea2f9 into cisocrgroup:dev Apr 25, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants