-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
improve ocropy processor #8
Conversation
- add proper error handling - use proper temporary files - re-introduce binarization (was commented) - add sanity checks (from ocropy CLI) - make de-warping optional
- abolish temporary files altogether: keep converting between pillow and array formats in memory - make logger available to all functions - make binarization method and dewarping selectable via ocrd-tool parameters - add binarization method from original ocropus-nlbin (including local whitelevel estimation and de-skewing) - calculate OCR-GT distances while processing and show CER per input file
I was not sure if ocropy should do all the image preprocessing. Shouldn't there be other steps in the ocr-d workflow that improve the image quality for the later ocr-step? |
I'm OK to merge this |
I agree. But where are they? So far I can only see the binarization in ocrd_kraken, but that does no deskewing (which seems to be important with OCR-D GT) and probably ocrd_olena. Both are not useable right now. Also, some steps (like dewarping) depend on the OCR model to use. |
Maybe we should open an issue about a workable pipeline for GT – but where? |
No idea. Maybe core? |
I tried applying your Python 3 port / workspace processor of ocropy, but the results were really bad on OCR-D GT. (I always use
OCR-D-GT-SEG-LINE
on different models, likeLatinHist-98000.pyrnn.gz
from chreul,incunabula-00184000.pyrnn.gz
from GT4HistOCR orfraktur-jze.pyrnn.gz
from jze. Those models are reported to achieve CER < 10%, and I can reproduce that when applying them on the test files in GT4HistOCR, which are deskewed, cropped and binarized / flattened properly. But I get CER >> 10% when applying them on our GT with your processor.)By investigating, I found that many models expect dewarping to be disabled, so I added this option to the processor. Also, the GT data are still raw, so I re-enabled binarization and made the actual method selectable. I got best results when I added the full binarization from
ocropus-nlbin
, which also includes deskewing. Now the line images extracted from preprocessing look a lot more like the data in GT4HistOCR (although the threshold parameters could be optimised a bit). Recognition results are also much better than before, but I still do not get CER < 10%.The most prominent difference of our line images is that large segments of the neighbouring lines still appear, so I guess one needs to re-crop after deskewing (but I don't know how to do that). But it might also be that the GT segmentation is the culprit (it seems to be off vertically).
Other improvements here are: error handling, ocropy sanity checks, no temporary files and CER calculation.
I hope you find these useful. Please let me know if you require further changes or splitting into smaller commits. Also, I would be happy to share my measurements and sample images.
review requested – @finkf @wrznr @kba ?