Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segmentation fault OCRing a washed out image #1601

Closed
konstantin-dzreev opened this issue May 24, 2018 · 31 comments
Closed

Segmentation fault OCRing a washed out image #1601

konstantin-dzreev opened this issue May 24, 2018 · 31 comments

Comments

@konstantin-dzreev
Copy link

I'm playing with tesseract trying to process bad images like really dark, or light, or the ones with very low contrast, etc. And I run into a file that causes tesseract to die with a segmentation fault error.

Environment

  • Tesseract Version: tesseract 4.0.0-beta.1-270-g5a56
  • Commit Number: 5a56d0c
  • Platform: Linux <host-name> 4.15.0-22-generic #24-Ubuntu SMP Wed May 16 12:15:17 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux (Ubuntu 18.04)
$ tesseract --version

tesseract 4.0.0-beta.1-270-g5a56
 leptonica-1.75.3
  libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.2) : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0

 Found AVX2
 Found AVX
 Found SSE

Current Behavior: Segmentation fault:

$ tesseract scan1.grey.png stdout

Error in boxClipToRectangle: box outside rectangle
Error in pixScanForForeground: invalid box
contains_unichar_id(unichar_id):Error:Assert failed:in file ../../src/ccutil/unicharset.h, line 511
Segmentation fault (core dumped)

Attachment: a file to reproduce the issue

scan1 grey

@Shreeshrii
Copy link
Collaborator

Duplicate issue - please see #1205

@zdenop Please close

@amitdo
Copy link
Collaborator

amitdo commented May 25, 2018

Try with this image:

dark-bin-127-147

@amitdo
Copy link
Collaborator

amitdo commented May 25, 2018

Error in boxClipToRectangle: box outside rectangle
Error in pixScanForForeground: invalid box

#427 #468

@zdenop
Copy link
Contributor

zdenop commented May 25, 2018

Duplicate

@zdenop zdenop closed this as completed May 25, 2018
@Shreeshrii
Copy link
Collaborator

Shreeshrii commented May 25, 2018

@konstantin-dzreev I am not able to reproduce the error regarding unichar-id and core-dump that you are getting (pasted below)

Error in boxClipToRectangle: box outside rectangle
Error in pixScanForForeground: invalid box
contains_unichar_id(unichar_id):Error:Assert failed:in file ../../src/ccutil/unicharset.h, line 511
Segmentation fault (core dumped)

My version info is the same:

 tesseract -v
tesseract 4.0.0-beta.1-270-g5a56
 leptonica-1.75.3
  libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.4.2) : libpng 1.2.54 : libtiff 4.0.6 : zlib 1.2.8 : libwebp 0.4.4 : libopenjp2 2.3.0

The only difference I see is that you have:

Found AVX2
 Found AVX
 Found SSE

libpng version is also different.

@stweil Can that make a difference?

@Shreeshrii
Copy link
Collaborator

Shreeshrii commented May 25, 2018

I found a different issue while processing this image with gdb (hoping to trace the crash).

Edit: made new issue #1603

@Shreeshrii
Copy link
Collaborator

Shreeshrii commented May 25, 2018

Related to issue #1603 - using image posted here by OP

Log file attached here
1601.log.txt

@Shreeshrii
Copy link
Collaborator

@zdenop Please reopen the issue. The title could be edited to say 'psm 6 producing gibberish' Thanks!

@zdenop
Copy link
Contributor

zdenop commented May 25, 2018

@Shreeshrii: your observation is different that original issue report (that is duplication of already open issues). Renaming it will just produce chaos...

@Shreeshrii
Copy link
Collaborator

@zdenop Good point. I will open a different issue for it and delete the comments from here. Thanks,

@amitdo
Copy link
Collaborator

amitdo commented May 25, 2018

OK, I tested it.

tesseract ~/Downloads/dark.png - --tessdata-dir ~/Downloads/tessdata/tessdata`
Error in boxClipToRectangle: box outside rectangle
Error in pixScanForForeground: invalid box
contains_unichar_id(unichar_id):Error:Assert failed:in file ../../../src/ccutil/unicharset.h, line 511
Segmentation fault

@Shreeshrii
Copy link
Collaborator

@amitdo tesseract and leptonica version, please!

@amitdo
Copy link
Collaborator

amitdo commented May 25, 2018

fast, best and the original (lstm+legacy) tessdata does not crash.

@Shreeshrii
Copy link
Collaborator

which version is
--tessdata-dir ~/Downloads/tessdata/tessdata?

does it correspond to current tessdata?

@amitdo
Copy link
Collaborator

amitdo commented May 25, 2018

$ uname -a
Linux debian 4.9.0-6-amd64 #1 SMP Debian 4.9.88-1+deb9u1 (2018-05-07) x86_64 GNU/Linux
$ tesseract -v
tsseract 4.0.0-beta.1-270-g5a56
 leptonica-1.76.1
  libgif 5.1.4 : libjpeg 6b (libjpeg-turbo 1.5.1) : libpng 1.6.28 : libtiff 4.0.8 : zlib 1.2.8 : libwebp 0.5.2 : libopenjp2 2.1.2
 Found AVX
 Found SSE

@amitdo
Copy link
Collaborator

amitdo commented May 25, 2018

The 2 latest commits both crash.

@Shreeshrii
Copy link
Collaborator

The 2 latest commits both crash.

of tessdata?

@Shreeshrii
Copy link
Collaborator

Shreeshrii commented May 25, 2018 via email

@amitdo
Copy link
Collaborator

amitdo commented May 25, 2018

If I use the manually binarized image I provided earlier, with those 2 newer traineddata from the tessdata repo, then there is no crash

@Shreeshrii
Copy link
Collaborator

Shreeshrii commented May 25, 2018 via email

@amitdo
Copy link
Collaborator

amitdo commented May 25, 2018

The version without cube is one of the two that crashes.

@Shreeshrii
Copy link
Collaborator

Shreeshrii commented May 25, 2018 via email

@Shreeshrii
Copy link
Collaborator

Shreeshrii commented May 25, 2018 via email

@amitdo
Copy link
Collaborator

amitdo commented May 25, 2018

Yes, with --oem 0 or --oem 1 it does not crash.

@Shreeshrii
Copy link
Collaborator

Shreeshrii commented May 25, 2018 via email

@Shreeshrii
Copy link
Collaborator

Shreeshrii commented May 25, 2018

Should we disable --oem 2?

That will also take care of other issues related to multi-language processing, where one language may only have LSTM model (Indic, Arabic) and others may have both, which also leads to similar problems.

It is not necessary that --oem 2 gives better / more accurate results.

@Shreeshrii
Copy link
Collaborator

See #235 (comment)

regarding --oem 2 issues with mix of languages.

@stweil
Copy link
Member

stweil commented May 28, 2018

It indeed looks like the wrong unicharset is used (eng.lstm-unicharset instead of eng.unicharset). If this can be confirmed, it would result in wrong decisions which text is best and cause the observed assertions (which finally trigger an intentional segmentation fault). Avoiding the crash is easy, but the right fix still needs some time at least for me.

The crash is not related to the cube removal.

@stweil
Copy link
Member

stweil commented May 28, 2018

I wonder why we need more than one unicharset and more than one word list. Both should not depend on the OCR engine used, and it should be possible to always use a superset fitting both engines. That would also reduce the trainedata size.

@Shreeshrii
Copy link
Collaborator

The crash is not related to the cube removal.

@stweil Yes, you are right. I was trying to recall from memory recent commits. Further testing indicated problem might be unicharsets.

If this can be confirmed,

If you can indicate what testing will help, I can do it.

I wonder why we need more than one unicharset and more than one word list.

My guess is that the language models depend on these, specially the LSTM model, which also uses a recoder/unicharcompressor for some languages. Using a different unicharset (even same unichars but different order in file) lead to wrong results.

it should be possible to always use a superset fitting both engines.

You could give it try. Use merge_unicharsets and keep the lstm-unicharset first in list.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants