Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OSD not working again with --psm 0 after latest 20181030 binary release #2062

Closed
CanadianHusky opened this issue Nov 18, 2018 · 18 comments
Closed
Labels
accuracy OSD Orientation and Script Detection
Milestone

Comments

@CanadianHusky
Copy link

Environment

  • Tesseract Version: 4.0.0.20181030 regression against 4.0.0-rc1
  • Platform: windows 64 bit

Binary release clean install from

https://github.com/UB-Mannheim/tesseract/wiki
https://digi.bib.uni-mannheim.de/tesseract/tesseract-ocr-w64-setup-v4.0.0.20181030.exe

Current Behavior:

orientation is detected wrong in supplied file with shown command line

image

WRONG Result :

Page number: 0
Orientation in degrees: 270
Rotate: 90
Orientation confidence: 14.00
Script: Latin
Script confidence: nan

Expected Behavior:

compare the same input against 4.0.0-rc1
image

CORRECT Result :

Page number: 0
Orientation in degrees: 0
Rotate: 0
Orientation confidence: 0.54
Script: Latin
Script confidence: 33.33

the orientation confidence value based on tests on thousdands of files in rc1 version is extremely accurate and makes sense. It is used as a threshold if the result can be trusted or not
the result from 20181030 release is horribly mistaken

Input Image :

image

Suggested Fix:

invesigate what lead to regression in OSD code

thank you kindly

@stweil
Copy link
Member

stweil commented Nov 19, 2018

This could be related to the changed handling of the alpha channel in PNG images: the latest Tesseract code replaces the alpha channel by white.

@CanadianHusky, could you please try both versions with the same image in other formats (for example JPEG or TIFF) or with a PNG without alpha channel?

@CanadianHusky
Copy link
Author

Hello,

@stweil
I have tested RC3 and RC4 and the final version 4-20181030 builds.
I used BMP and JPG input of the same image.
All of them suffer from the same problem and fail to detect orientation correctly, that used to be working in RC1
The problem must have been introduced somewhere between the date ranges of RC1 and RC3
thank you

@stweil stweil pinned this issue Jan 9, 2019
@stweil stweil unpinned this issue Jan 9, 2019
@stweil stweil pinned this issue Jan 9, 2019
@zdenop zdenop added this to the 4.1.0 milestone Feb 16, 2019
@CanadianHusky
Copy link
Author

Hello, I see a new pre-compiled release at https://digi.bib.uni-mannheim.de/tesseract/ for

tesseract-ocr-w64-setup-v4.1.0.20190314.exe

and tested that release against the issue mentioned above.

The result on the input image is still incorrect.
I am unsure if the binary release I have used is really a 4.1.0 release or if this an intermediary build.

thank you

@stweil
Copy link
Member

stweil commented Mar 15, 2019

That binary is based on latest Tesseract sources (Git master).

@zdenop
Copy link
Contributor

zdenop commented May 9, 2019

@CanadianHusky: you can copy and paste terminal output by mouse select (with left button, and if you then click with right in terminal you have selection in clipboard) - it is more useful than screenshots.

I made test with the latest code (5.0.0-alpha-50-g3f4dc) and best tessdata:

> tesseract i2062.png - --dpi 175 -c min_characters_to_try=10 --psm 0 -l eng
Warning, detects only orientation with -l eng
Warning. Invalid resolution 0 dpi. Using 70 instead.
Page number: 0
Orientation in degrees: 270
Rotate: 90
Orientation confidence: 14.00
Script: Latin
Script confidence: -nan(ind)

But if I skip language specification (eng should be used anyway) I got different result:

> tesseract i2062.png - --dpi 175 -c min_characters_to_try=10 --psm 0
Warning. Invalid resolution 0 dpi. Using 70 instead.
Page number: 0
Orientation in degrees: 0
Rotate: 0
Orientation confidence: 0.28
Script: Greek
Script confidence: 4.36

Detection of orientation is correct, but script is wrong. This is quiet strange that specification of eng language is cause different result...

@zdenop
Copy link
Contributor

zdenop commented May 9, 2019

And using tessdata (e.g. not fast, not best) provide correct result:

tesseract i2062.png - --psm 0 --tessdata-dir tessdata -c min_characters_to_try=10 -l eng
Warning, detects only orientation with -l eng
Warning: Invalid resolution 0 dpi. Using 70 instead.
Estimating resolution as 174
Page number: 0
Orientation in degrees: 0
Rotate: 0
Orientation confidence: 0.54
Script: Latin
Script confidence: 33.33

Seems like LSTM model is not able to detect correctly orientation on this kind of images (Too few characters), but legacy is working fine:

pi@raspberrypi:/usr/src/test $ tesseract i2062.png - --psm 0 --tessdata-dir tessdata --oem 0 --dpi 175 -c min_characters_to_try=10 -l eng
Warning, detects only orientation with -l eng
Page number: 0
Orientation in degrees: 0
Rotate: 0
Orientation confidence: 0.54
Script: Latin
Script confidence: 33.33
pi@raspberrypi:/usr/src/test $ tesseract i2062.png - --psm 0 --tessdata-dir tessdata --oem 1 --dpi 175 -c min_characters_to_try=10 -l eng
Warning, detects only orientation with -l eng
Page number: 0
Orientation in degrees: 270
Rotate: 90
Orientation confidence: 14.00
Script: Latin
Script confidence: nan
pi@raspberrypi:/usr/src/test $ tesseract i2062.png - --psm 0 --tessdata-dir tessdata --oem 2 --dpi 175 -c min_characters_to_try=10 -l eng
Warning, detects only orientation with -l eng
Page number: 0
Orientation in degrees: 0
Rotate: 0
Orientation confidence: 0.54
Script: Latin
Script confidence: 33.33
pi@raspberrypi:/usr/src/test $ tesseract i2062.png - --psm 0 --tessdata-dir tessdata --oem 3 --dpi 175 -c min_characters_to_try=10 -l eng
Warning, detects only orientation with -l eng
Page number: 0
Orientation in degrees: 0
Rotate: 0
Orientation confidence: 0.54
Script: Latin
Script confidence: 33.33

@zdenop
Copy link
Contributor

zdenop commented May 9, 2019

More details, that can bring some light how it works:

If there is not language specification - only osd.traineddata is used (according strace report) That is reason why Script detection is not correct.
When there is specification of language -l eng then:

  • first eng.traineddata is opened
  • next image is opened
  • and than osd.traineddata is opened...

I am not sure if we can/want do something with this.

@CanadianHusky
Copy link
Author

As soon as I see a stable binary release that I can test, I will try those suggested command line options.
if using --oem option with the correct value is able to detect correct orientation and a reasonable confidence value, that is sufficient. It does not matter to me personally if the detection is done with LSTM or legacy code. Of course it is very desirable that this sort of orientation detection works as fast as possible. I appreciate the provided information. Thank you @zdenop

@zdenop
Copy link
Contributor

zdenop commented May 9, 2019

If my observation is correct you do not need to wait for stable release: just use tessdata repository for OSD.

@stweil
Copy link
Member

stweil commented Jun 23, 2019

@zdenop, it is normal that only osd.traineddata is used if no explicit language was given. That file includes a selection of more than 1700 unicode characters from different scripts which are used to detect the right script. It is only available for the legacy OCR engine. Therefore it won't work if you use --oem 1 or compile Tesseract without that engine.

My tests with latest Tesseract code all give the right orientation as long as I do not add --oem 1.

@zdenop
Copy link
Contributor

zdenop commented Jun 24, 2019

So what is the status of this issue? Can it be closed?

@stweil
Copy link
Member

stweil commented Jun 24, 2019

@CanadianHusky, do you still have that problem?

@CanadianHusky
Copy link
Author

Orientation detection still has problems for me. Here are my test results, after having adjusted the command line as recommended by @stweil

Test environment :
clean install from https://digi.bib.uni-mannheim.de/tesseract/tesseract-ocr-w64-setup-v5.0.0.20190623.exe

image

all 3 input images are 0 degrees, but get detected with incorrected result.
I admit that input 3 image is poor quality and a higher preprocessing resolution does find the correct result. However input 2 and 4 are as good as its going to get images with clean and large enough letters that I would have liked to see a correct result.

Am I still doing something wrong in the command line ?

input2 image :
image

input 3 image :
image

input 4 image :
image

also worth noting, adding -l eng (or -l deu) changes the orientation detection result, still to an incorrect result, but very high confidence.

image

@Shreeshrii
Copy link
Collaborator

Please see https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!msg/tesseract-ocr/9HSpp7Ysduw/r8FPCHhBFAAJ

It might be related to this OSD related issue.

@amitdo
Copy link
Collaborator

amitdo commented May 17, 2020

Reading @zdenop and @stweil comment, it seems that there in no regression in newer versions with the first image in this issue.

Nobody commented about the other images. It is not clear if the OP claims that there is a regression here too, or just complains about the wrong result.

@amitdo
Copy link
Collaborator

amitdo commented May 18, 2020

I tested the input2 image.

I got correct result with:

tesseract input2.png input2 --psm 0 -l eng --tessdata-dir $testadadir/tessdata -c min_characters_to_try=10

console:

Warning, detects only orientation with -l eng
Tesseract Open Source OCR Engine v5.0.0-alpha-580-g87841 with Leptonica
Warning: Invalid resolution 0 dpi. Using 70 instead.
Estimating resolution as 225
Warning. Invalid resolution 0 dpi. Using 70 instead.

input2.osd

Page number: 0
Orientation in degrees: 0
Rotate: 0
Orientation confidence: 1.36
Script: Latin
Script confidence: 29.17

I'm not going to bother testing more images.

@amitdo amitdo unpinned this issue May 18, 2020
@CanadianHusky
Copy link
Author

CanadianHusky commented May 18, 2020

Thank you for revisiting this issue. In the meantime I have discovered the source of the inconsistency.
The issue is not a regression in the code itself but depends in which TRAINEDDATA file is used.
When I do a clean install from https://digi.bib.uni-mannheim.de/tesseract/tesseract-ocr-w64-setup-v5.0.0.20190623.exe or any recent release...

This data file is installed
image

Now observe these tests, only -l eng changes. Expected result is 0 degrees and meaningful confidence value

C:\Program Files\Tesseract-OCR>tesseract --version
tesseract v5.0.0-alpha.20191030
 leptonica-1.78.0
  libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.3) : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0
 Found AVX
 Found SSE
 Found libarchive 3.3.2 zlib/1.2.11 liblzma/5.2.3 bz2lib/1.0.6 liblz4/1.7.5

C:\Program Files\Tesseract-OCR>tesseract --tessdata-dir "C:\Program Files\Tesseract-OCR\tessdata" --psm 0 -l eng -c min_characters_to_try=10 "input2.png" stdout
Warning, detects only orientation with -l eng
Page number: 0
Orientation in degrees: 270
Rotate: 90
Orientation confidence: 50.00
Script: Latin
Script confidence: 2.00

WRONG 

C:\Program Files\Tesseract-OCR>tesseract --tessdata-dir "C:\Program Files\Tesseract-OCR\tessdata" --psm 0 -l eng_15040 -c min_characters_to_try=10 "input2.png" stdout
Warning, detects only orientation with -l eng_15040
Page number: 0
Orientation in degrees: 270
Rotate: 90
Orientation confidence: 50.00
Script: Latin
Script confidence: 2.00

WRONG

C:\Program Files\Tesseract-OCR>tesseract --tessdata-dir "C:\Program Files\Tesseract-OCR\tessdata" --psm 0 -l eng_22917 -c min_characters_to_try=10 "input2.png" stdout
Warning, detects only orientation with -l eng_22917
Page number: 0
Orientation in degrees: 0
Rotate: 0
Orientation confidence: 1.38
Script: Latin
Script confidence: 30.00

CORRECT!

Here the trained data files
image

These are the files in tessdata and clearly the source of the issue for me is that the original file installed with the binary distribution does not give the expected result. File eng_22917 was downloaded seperately from the traineddata repository

I would be interested to know what size your eng.traineddata file is and where it is from.

The source for my trained data files are as follows:

https://github.com/tesseract-ocr/tessdata/blob/master/eng.traineddata
22917kb and the only file that works for orientation detection
probably because it has the legacy models that OSD code needs

https://github.com/tesseract-ocr/tessdata_fast/blob/master/eng.traineddata
4017kb, also part of the binary installation, does not work with --psm 0 for orientation detection purposes for me

https://github.com/tesseract-ocr/tessdata_best/blob/master/eng.traineddata
15040kb, does not work with --psm 0 for orientation detection purposes for me

It took me very long time to understand and figure out this issue. I hope this information helps someone else. I have closed the issue.

I suppose the question now becomes if it makes sense to add a note to the binary distribution or elsewhere in the release notes from @stweil that the included default traineddata file is the fast integer model, which is totally fine for most users when all thay want to do is regular OCR. For anyone that is interested in OSD only like me, the traineddata files that I linked to must be used as far as I see from my tests.
Thanks again for having this pinned and looked into. Much appreciated.

@amitdo
Copy link
Collaborator

amitdo commented May 18, 2020

I would be interested to know what size your eng.traineddata file is and where it is from.

I used eng.traindata from the tessdata repo.

https://github.com/tesseract-ocr/tessdata/blob/d87b3cbc7555/eng.traineddata

Size: 24.5 MB (24,530,234 bytes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
accuracy OSD Orientation and Script Detection
Projects
None yet
Development

No branches or pull requests

5 participants