-
Notifications
You must be signed in to change notification settings - Fork 9.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Accuracy problem 4.0 beta3 -> 4.0 final? #2048
Comments
Thank you for reporting this. I can confirm your results and will have a look what caused this regression. |
4.0.0-rc1 was the first bad version. |
Here is the result of
So the fix for issue #1914 (commit 5fe1390) caused an accuracy regression for other images. |
@niksedk : just wondering: how did you build tesseract with vcpg? As far as I see there support 3.05.02 only... |
@zdenop: E.g. like |
Problem is related to input image: the letters are white with black outline which is exposed by removal of alpha channel with white color:
Just note: removal of alpha channel was implemented because (some?) png images were not processed by tesseract correctly and pdf creation did it anyway (just for pdf output and not for OCR). If alpha channel is not removed: tesseract did binarization different way (if I got it correctly it inverted input image) and it use this image for OCR: I feel stuck:
So what ever we will do, some users will not be satisfied... |
According to your analysis tesseract is doing the right thing.
This knowledge should be documented somewhere, maybe in the ImproveQuality page. |
If the input is not similar to the trained data, you should not expect good results. |
I put remark to ImproveQuality. |
Thx for the info. @zdenop: So black letters without outline and a white background is likely the way to go? |
I don't think this is logical at all.
Now: how do you know if the image needs this pre-processing if you are building an automated system when all kinds of images are input? Tesseract should figure out what to do, outlined texts are texts too. |
There are a lot of really weird PNG images out there. I was too afraid to touch how recognition works when I was looking at the alpha channel problem during PDF generation. Some have text shapes defined entirely in the alpha channel. It was never was clear to me what is the best thing to do; at one point I had considered to running recognition on each color channel separately, including alpha. |
@cypherbits : tesseract is OCR engine. It was always communicated that user should do preprocessing of image. Wiki page Improving quality is one of the oldest. So users is always responsible for quality of image input. There was never intention that tesseract will figure out how t improve image (and there was plenty request for automatic screenshot OCR). Yes, tesseract do some image processing e.g. binarization, but it use otsu algorithm for it. It does not work for all images, but if you do not like, you can do binarize image by yourself with other algorithm. In the same logic: we can do nothing regarding alpha channel (e.g. some image will work, some not). Because usually for user is difficult to identify that problem with image OCR is alpha channel I choose strategy to do something: replace alpha channel with white... What you are requiring is that we will change tesseract from OCR engine to OCR suite. This is simple no goal, simply because of lack of resources (programmers willing to contribute to opensource). But you can take is your business opportunity as several people did ;-) Your patches for tesseract handling all kind of text are welcomed. |
@niksedk: basically you can use black text on white but also white text on black - but text should not be outlined... If you are OCR subtitles, you should be able to improve image easily: remove transparency with outline color (maybe just inverting image could work)... |
I'm having some issues with accuracy when upgrading from 4.0 beta 3 to 4.0 final.
Setup:
cmd: tesseract
image
output -l eng --oem 1tessdata: https://github.com/tesseract-ocr/tessdata/raw/master/eng.traineddata
platform: windows 7 (Tesseract compiled via vcpg)
Results:
a.png
tesseract 4.0.0 final: In @ crowded city, as | bump shoulders, I'm all alone
tesseract 4.0 beta 3: In a crowded city, as | bump shoulders, I'm all alone
b.png
tesseract 4.0.0 final: ane unexpectedly, ror whatever reason,
tesseract 4.0 beta 3: and unexpectedly, for whatever reason,
c.png
tesseract 4.0.0 final: they show me kincdness, | loetl
tesseract 4.0 beta 3: they show me kindness, | bet!
d.png
tesseract 4.0.0 final: [t wash t my rault...l
tesseract 4.0 beta 3: It wasn't my fault...!
test-images.zip
The text was updated successfully, but these errors were encountered: