You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
That's normal, because Tesseract's image processing and layout detection was never designed for such images. And in a short test it detects surprisingly much text:
$ convert M01051\ poster.pdf Image.jpg
$ tesseract Image.jpg - -l script/Latin
NewPower Sign up today!
Connections
24/ 7 NewPower
Energy Manager
ANYWHERE, ANYTIME
InternetIP
Network
personal
; digital
assistant
WALKING
computer P
] m Pa thermostat
OFFICE
= outdoor lighting
NewPower | InternetHomeAlliance l COACTIVE' min, | SEARS.
Connections NETWORKS
Broadband / Phoneline gateway
existing powerline
FUTURE
CONNECTIONS
waterheater
K
I appreciate the quick response, Stefan!
Thanks for your analysis on JPG. But what about the TIFFs, which are part of our established process flow. Could you try those on your end (from the provided ZIP)?
Those TIFF files give similar bad results for me like in your tests. If you use different values for the DPI (by adding --dpi 600 for example) the results change. You can also try --psm 12 (which will find a lot of relevant text and also a huge amount of wrong text) and many more parameters, but I am afraid that nobody here has the time to help you with recognition issues.
Current Behavior
TIFFs which look the same to the user but slightly vary in size result in completely different extracted text.
Please see the samples in attached ZIP.
Our invocation is
sudo tesseract <input_absFilePath>.tif <output_absFilePath> -l eng
Expected Behavior
All the text from the image should be extracted.
Please see the sample in attached ZIP.
Suggested Fix
No response
tesseract -v
We tried 2 versions:
tesseract 5.3.3
leptonica-1.83.1
libgif 5.1.9 : libjpeg 8d (libjpeg-turbo 2.0.3) : libpng 1.6.37 : libtiff 4.1.0 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.1
Found AVX512BW
Found AVX512F
Found AVX2
Found AVX
Found FMA
Found SSE4.1
Found OpenMP 201511
... and ...
tesseract 5.4.0-rc2-17-g3469
leptonica-1.79.0
libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 2.0.3) : libpng 1.6.37 : libtiff 4.1.0 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.1
Found AVX2
Found AVX
Found FMA
Found SSE4.1
Found OpenMP 201511
Found libarchive 3.4.0 zlib/1.2.11 liblzma/5.2.4 bz2lib/1.0.8 liblz4/1.9.2 libzstd/1.4.4
Found libcurl/7.68.0 OpenSSL/1.1.1f zlib/1.2.11 brotli/1.0.7 libidn2/2.2.0 libpsl/0.21.0 (+libidn2/2.2.0) libssh/0.9.3/openssl/zlib nghttp2/1.40.0 librtmp/2.3
Operating System
No response
Other Operating System
Ubuntu 20.04.6 LTS ("focal")
uname -a
Linux ip-xx-xxx-xx-xxx d.dd.d-dddd-aws #68~20.04.1-Ubuntu SMP Wed May 1 15:24:09 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
Compiler
we invoke tesseract as a command
CPU
hosted in AWS
Virtualization / Containers
tesseract 5.4.0-rc2-17-g3469 was used on Docker
tesseract 5.3.3 was used in AWS
Other Information
Please see the attached ZIP and read its REAME.txt for further information.
TesseractOCRissueForTIFFs.zip
The text was updated successfully, but these errors were encountered: