Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OCR from grayscale TIFFs produces inconsistent results #4268

Closed
ed-epiq opened this issue Jun 12, 2024 · 3 comments
Closed

OCR from grayscale TIFFs produces inconsistent results #4268

ed-epiq opened this issue Jun 12, 2024 · 3 comments

Comments

@ed-epiq
Copy link

ed-epiq commented Jun 12, 2024

Current Behavior

TIFFs which look the same to the user but slightly vary in size result in completely different extracted text.
Please see the samples in attached ZIP.

Our invocation is
sudo tesseract <input_absFilePath>.tif <output_absFilePath> -l eng

Expected Behavior

All the text from the image should be extracted.
Please see the sample in attached ZIP.

Suggested Fix

No response

tesseract -v

We tried 2 versions:

tesseract 5.3.3
leptonica-1.83.1
libgif 5.1.9 : libjpeg 8d (libjpeg-turbo 2.0.3) : libpng 1.6.37 : libtiff 4.1.0 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.1
Found AVX512BW
Found AVX512F
Found AVX2
Found AVX
Found FMA
Found SSE4.1
Found OpenMP 201511

... and ...

tesseract 5.4.0-rc2-17-g3469
leptonica-1.79.0
libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 2.0.3) : libpng 1.6.37 : libtiff 4.1.0 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.1
Found AVX2
Found AVX
Found FMA
Found SSE4.1
Found OpenMP 201511
Found libarchive 3.4.0 zlib/1.2.11 liblzma/5.2.4 bz2lib/1.0.8 liblz4/1.9.2 libzstd/1.4.4
Found libcurl/7.68.0 OpenSSL/1.1.1f zlib/1.2.11 brotli/1.0.7 libidn2/2.2.0 libpsl/0.21.0 (+libidn2/2.2.0) libssh/0.9.3/openssl/zlib nghttp2/1.40.0 librtmp/2.3

Operating System

No response

Other Operating System

Ubuntu 20.04.6 LTS ("focal")

uname -a

Linux ip-xx-xxx-xx-xxx d.dd.d-dddd-aws #68~20.04.1-Ubuntu SMP Wed May 1 15:24:09 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux

Compiler

we invoke tesseract as a command

CPU

hosted in AWS

Virtualization / Containers

tesseract 5.4.0-rc2-17-g3469 was used on Docker
tesseract 5.3.3 was used in AWS

Other Information

Please see the attached ZIP and read its REAME.txt for further information.
TesseractOCRissueForTIFFs.zip

@stweil
Copy link
Member

stweil commented Jun 12, 2024

That's normal, because Tesseract's image processing and layout detection was never designed for such images. And in a short test it detects surprisingly much text:

$ convert M01051\ poster.pdf Image.jpg
$ tesseract Image.jpg - -l script/Latin

NewPower Sign up today!

Connections

24/ 7 NewPower

Energy Manager

ANYWHERE, ANYTIME

 InternetIP
Network

personal
; digital
assistant
WALKING

computer P

] m Pa thermostat

OFFICE

= outdoor lighting

NewPower | InternetHomeAlliance l COACTIVE' min, | SEARS.

Connections NETWORKS

Broadband / Phoneline gateway

existing powerline

FUTURE
CONNECTIONS

waterheater

K

@ed-epiq
Copy link
Author

ed-epiq commented Jun 12, 2024

I appreciate the quick response, Stefan!
Thanks for your analysis on JPG. But what about the TIFFs, which are part of our established process flow. Could you try those on your end (from the provided ZIP)?

@stweil
Copy link
Member

stweil commented Jun 12, 2024

Those TIFF files give similar bad results for me like in your tests. If you use different values for the DPI (by adding --dpi 600 for example) the results change. You can also try --psm 12 (which will find a lot of relevant text and also a huge amount of wrong text) and many more parameters, but I am afraid that nobody here has the time to help you with recognition issues.

Therefore I close this issue.

@stweil stweil closed this as completed Jun 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants