OCR from grayscale TIFFs produces inconsistent results #4268

ed-epiq · 2024-06-12T14:03:28Z

Current Behavior

TIFFs which look the same to the user but slightly vary in size result in completely different extracted text.
Please see the samples in attached ZIP.

Our invocation is
sudo tesseract <input_absFilePath>.tif <output_absFilePath> -l eng

Expected Behavior

All the text from the image should be extracted.
Please see the sample in attached ZIP.

Suggested Fix

No response

tesseract -v

We tried 2 versions:

tesseract 5.3.3
leptonica-1.83.1
libgif 5.1.9 : libjpeg 8d (libjpeg-turbo 2.0.3) : libpng 1.6.37 : libtiff 4.1.0 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.1
Found AVX512BW
Found AVX512F
Found AVX2
Found AVX
Found FMA
Found SSE4.1
Found OpenMP 201511

... and ...

tesseract 5.4.0-rc2-17-g3469
leptonica-1.79.0
libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 2.0.3) : libpng 1.6.37 : libtiff 4.1.0 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.1
Found AVX2
Found AVX
Found FMA
Found SSE4.1
Found OpenMP 201511
Found libarchive 3.4.0 zlib/1.2.11 liblzma/5.2.4 bz2lib/1.0.8 liblz4/1.9.2 libzstd/1.4.4
Found libcurl/7.68.0 OpenSSL/1.1.1f zlib/1.2.11 brotli/1.0.7 libidn2/2.2.0 libpsl/0.21.0 (+libidn2/2.2.0) libssh/0.9.3/openssl/zlib nghttp2/1.40.0 librtmp/2.3

Operating System

No response

Other Operating System

Ubuntu 20.04.6 LTS ("focal")

uname -a

Linux ip-xx-xxx-xx-xxx d.dd.d-dddd-aws #68~20.04.1-Ubuntu SMP Wed May 1 15:24:09 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux

Compiler

we invoke tesseract as a command

CPU

hosted in AWS

Virtualization / Containers

tesseract 5.4.0-rc2-17-g3469 was used on Docker
tesseract 5.3.3 was used in AWS

Other Information

Please see the attached ZIP and read its REAME.txt for further information.
TesseractOCRissueForTIFFs.zip

stweil · 2024-06-12T16:59:32Z

That's normal, because Tesseract's image processing and layout detection was never designed for such images. And in a short test it detects surprisingly much text:

$ convert M01051\ poster.pdf Image.jpg
$ tesseract Image.jpg - -l script/Latin

NewPower Sign up today!

Connections

24/ 7 NewPower

Energy Manager

ANYWHERE, ANYTIME

 InternetIP
Network

personal
; digital
assistant
WALKING

computer P

] m Pa thermostat

OFFICE

= outdoor lighting

NewPower | InternetHomeAlliance l COACTIVE' min, | SEARS.

Connections NETWORKS

Broadband / Phoneline gateway

existing powerline

FUTURE
CONNECTIONS

waterheater

K

ed-epiq · 2024-06-12T17:40:30Z

I appreciate the quick response, Stefan!
Thanks for your analysis on JPG. But what about the TIFFs, which are part of our established process flow. Could you try those on your end (from the provided ZIP)?

stweil · 2024-06-12T17:51:55Z

Those TIFF files give similar bad results for me like in your tests. If you use different values for the DPI (by adding --dpi 600 for example) the results change. You can also try --psm 12 (which will find a lot of relevant text and also a huge amount of wrong text) and many more parameters, but I am afraid that nobody here has the time to help you with recognition issues.

Therefore I close this issue.

stweil closed this as completed Jun 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OCR from grayscale TIFFs produces inconsistent results #4268

OCR from grayscale TIFFs produces inconsistent results #4268

ed-epiq commented Jun 12, 2024

stweil commented Jun 12, 2024

ed-epiq commented Jun 12, 2024

stweil commented Jun 12, 2024

OCR from grayscale TIFFs produces inconsistent results #4268

OCR from grayscale TIFFs produces inconsistent results #4268

Comments

ed-epiq commented Jun 12, 2024

Current Behavior

Expected Behavior

Suggested Fix

tesseract -v

Operating System

Other Operating System

uname -a

Compiler

CPU

Virtualization / Containers

Other Information

stweil commented Jun 12, 2024

ed-epiq commented Jun 12, 2024

stweil commented Jun 12, 2024