Tesseract 4 cannot use anything other than --oem 0 #1043

nickbe · 2017-07-18T18:36:27Z

Platform is Debian Jessie - Tesseract 4.00 Git Version.
Platform: Linux localhost 4.4.27-x86_64-jb1 #4 SMP Tue Jun 6 14:41:09 CEST 2017 x86_64 GNU/Linux

Tesseract crashes with "Illegal Instruction" when using anything other than --oem 0

Tesseract Open Source OCR Engine v4.00.00alpha with Leptonica
Detected 35 diacritics
Illegal instruction

Tesseract -v reports

tesseract 4.00.00alpha
 leptonica-1.74.4
  libjpeg 6b (libjpeg-turbo 1.3.1) : libpng 1.2.50 : libtiff 4.0.3 : zlib 1.2.8 : libopenjp2 2.1.0

 Found AVX
 Found SSE

I can scan with --oem 0 though.

The text was updated successfully, but these errors were encountered:

Shreeshrii · 2017-07-18T18:49:35Z

what is the version of your traineddata files? Download latest version from the tessdata repo.

nickbe · 2017-07-18T21:08:28Z

Ok, so now I reinstalled tesseract just to make sure I did everything right.
Tessdata files like 'eng.traineddata' have now been downloaded directly from the repo into /usr/local/share/tessdata

Current content:
configs deu.traineddata eng.traineddata pdf.ttf tessconfigs

Now Tesseract starts but tells me that it can't load any language. Which is quite odd.

tesseract --tessdata-dir /usr/local/share/tessdata/tessdata -l eng test.jpg out
results in:

Error opening data file /usr/local/share/tessdata/eng.traineddata
Please make sure the TESSDATA_PREFIX environment variable is set to your "tessdata" directory.
Failed loading language 'eng'
Tesseract couldn't load any languages!
Could not initialize tesseract.

tesseract -l eng test.jpg out
results in:
Error opening data file /usr/local/share/eng.traineddata

and
tesseract --tessdata-dir /usr/local/share/tessdata -l eng test.jpg out
also results in:
Error opening data file /usr/local/share/eng.traineddata

And whatever I set the TESSDATA_PREFIX to, (like TESSDATA_PREFIX=/usr/share/tesseract-ocr/tessdata) does not get honored at all.
I simply don't get it. What's going on here?

nickbe · 2017-07-18T21:22:26Z

Ok, I solved the language problem. After unsetting TESSDATA_PREFIX and simply using:
wget https://github.com/tesseract-ocr/tessdata/raw/4.00/deu.traineddata
Tesseract seems to be able to load the language files from the default /usr/local/share/tessdata again.

But still --oem 1 results in:

Tesseract Open Source OCR Engine v4.00.00alpha with Leptonica
Detected 35 diacritics
Illegal instruction

nickbe · 2017-07-18T21:43:10Z

When using the data files from:
git clone --depth=1 https://github.com/tesseract-ocr/tessdata.git tessdata-repo
tesseract fails to load the language files.

But when using the data files from: https://github.com/tesseract-ocr/tesseract/wiki/Data-Files
by downloading with:
wget https://github.com/tesseract-ocr/tessdata/raw/4.00/eng.traineddata
I can start tesseract with --oem 0, but --oem 1 or --oem 2 results in the illegal instruction message

Both ways I put the files into /usr/local/share/tessdata

Shreeshrii · 2017-07-19T03:29:59Z

Test with the tif file in testing directory. It works ok for me.
My traineddata files are in ../tessdata directory

# tesseract phototest.tif phototest --tessdata-dir ../
Tesseract Open Source OCR Engine v4.00.00dev-2067 with Leptonica
Page 1

# tesseract phototest.tif phototest --tessdata-dir ../ --oem 1
Tesseract Open Source OCR Engine v4.00.00dev-2067 with Leptonica
Page 1

# tesseract phototest.tif phototest --tessdata-dir ../ --oem 2
Tesseract Open Source OCR Engine v4.00.00dev-2067 with Leptonica
Page 1


# tesseract -v
tesseract 4.00.00dev-2067
 leptonica-1.74.4
  libjpeg 8d (libjpeg-turbo 1.3.0) : libpng 1.2.50 : libtiff 4.0.3 : zlib 1.2.8

 Found AVX
 Found SSE

Shreeshrii · 2017-07-19T03:31:22Z

When you say ' Tesseract 4.00 Git Version' I take it to mean that you are using the latest source from github to build tesseract.

nickbe · 2017-07-19T16:38:51Z

That's correct.

amitdo · 2017-07-19T16:51:02Z

Please test tesseract with phototest.tif, as Shree suggested.
https://github.com/tesseract-ocr/tesseract/blob/master/testing/phototest.tif

nickbe · 2017-07-19T17:35:13Z

OK. I tested it with the traineddata above. But also it's the same I'm using here.
I also confirmed that tesseract in indeed using the right data folder.

But again the phototest.tif works fine with --oem 0 and results in the same error "illegal instructions" for any other --oem option or none (default should be --oem 2 if I'm not mistaken)

And although compilation seemed fine. I didn't see an error or warning. So I guess there must be some library missing here.

Also I reinstalled Leptonica and Tesseract multiple times now.

Here's how I've installed the tools:

1. Make sure that the following libraries are installed:

       # nickbe:  I had to replace libpng12-dev for debian jessie

	apt-get install autoconf-archive automake g++ libtool libleptonica-dev pkg-config
	apt-get install libpango1.0-dev

	# sudo apt-get install g++ # or clang++
	sudo apt-get install autoconf automake libtool
	sudo apt-get install autoconf-archive
	sudo apt-get install pkg-config
	sudo apt-get install libpng12-dev
	sudo apt-get install libjpeg-turbo
	sudo apt-get install libtiff5-dev
	sudo apt-get install zlib1g-dev

	sudo apt-get install libicu-dev
	sudo apt-get install libpango1.0-dev
	sudo apt-get install libcairo2-dev

2. Install Leptonica:

	git clone --depth 1 https://github.com/DanBloomberg/leptonica.git leptonica
	cd leptonica
	./autobuild
	./configure
	make
	sudo make install
	ldconfig

3. Install Tesseract:

    git clone --depth 1  https://github.com/tesseract-ocr/tesseract.git tesseract-ocr
    cd tesseract-ocr
    ./autogen.sh

    ./configure --disable-openmp --disable-shared --disable-static
    or
    ./configure        # nickbe: I TESTED BOTH CONFIGURATIONS JUST TO MAKE SURE
    make

    sudo make install
	sudo ldconfig

	# sudo make training
	# sudo make training-install

	sudo make install-langs      # nickbe: Never does anything so far
      sudo ldconfig

4. wget tessdata from https://github.com/tesseract-ocr/tesseract/wiki/Data-Files
   to /usr/local/share/tessdata

   Example: wget https://github.com/tesseract-ocr/tessdata/raw/4.00/eng.traineddata

Shreeshrii · 2017-07-19T17:46:35Z

build with --enable-debug and run with gdb to get additional info. ShreeDevi

…

____________________________________________________________ भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Wed, Jul 19, 2017 at 11:05 PM, Nick ***@***.***> wrote: OK. I tested it with the traineddata above. But also it's the same I'm using here. I also confirmed that tesseract in indeed using the right data folder. But again the phototest.tif works fine with --oem 0 and results in the same error "illegal instructions" for any other --oem option or none (default should be --oem 2 if I'm not mistaken) And although compilation seemed fine. I didn't see an error or warning. So I guess there must be some library missing here. Also I reinstalled Leptonica and Tesseract multiple times now. Here's how I've installed the tools: 1. Make sure that the following libraries are installed: # nickbe: I had to replace libpng12-dev for debian jessie apt-get install autoconf-archive automake g++ libtool libleptonica-dev pkg-config apt-get install libpango1.0-dev # sudo apt-get install g++ # or clang++ (presumably) sudo apt-get install autoconf automake libtool sudo apt-get install autoconf-archive sudo apt-get install pkg-config sudo apt-get install libpng12-dev sudo apt-get install libjpeg-turbo sudo apt-get install libtiff5-dev sudo apt-get install zlib1g-dev sudo apt-get install libicu-dev sudo apt-get install libpango1.0-dev sudo apt-get install libcairo2-dev 2. Install Leptonica: git clone --depth 1 https://github.com/DanBloomberg/leptonica.git leptonica cd leptonica ./autobuild ./configure make sudo make install ldconfig 3. Install Tesseract: git clone --depth 1 https://github.com/tesseract-ocr/tesseract.git tesseract-ocr cd tesseract-ocr ./autogen.sh ./configure --disable-openmp --disable-shared --disable-static or ./configure # nickbe: I TESTED BOTH CONFIGURATIONS JUST TO MAKE SURE make sudo make install sudo ldconfig # sudo make training # sudo make training-install sudo make install-langs # nickbe: Never does anything so far sudo ldconfig 4. wget tessdata from https://github.com/tesseract-ocr/tesseract/wiki/Data-Files to /usr/local/share/tessdata Example: wget https://github.com/tesseract-ocr/tessdata/raw/4.00/eng.traineddata — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#1043 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AE2_oxqBBceuQv-LRbrZSoOB_RyHelUCks5sPj5ngaJpZM4ObwNC> .

amitdo · 2017-07-19T18:50:06Z

Do not install libleptonica-dev with apt-get, since you manually intsall leptonica later.

amitdo · 2017-07-19T18:53:36Z

apt-get uninstall libleptonica-dev

nickbe · 2017-07-19T19:05:56Z

OK I enabled debug. I also installed the gdb package, but I have no experience with it. How can I provide more information?

amitdo · 2017-07-19T19:11:47Z

sudo make install-langs # nickbe: Never does anything so far

The comment is correct, so there's no point in doing that.

nickbe · 2017-07-19T19:16:29Z

I think one quite important information is that I just installed the package on a fresh instance of debian stretch. I had no problems with installating Leptonica or Tesseract, but after everything was installed I have exactly the same behaviour on this machine. Tesseract runs with --oem 0 but throws the "illegal instruction" message when trying to use --oem 2 or 1.

Seems that there's something very profound missing in the installation procedure.

nickbe · 2017-07-19T19:24:39Z

Ok. I managed to run Tesseract with gdb. Here's the output:


(gdb) set args -l eng --oem 2 test.png out
(gdb) run
Starting program: /usr/local/bin/tesseract -l eng --oem 2 test.png out
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Tesseract Open Source OCR Engine v4.00.00alpha with Leptonica
Detected 35 diacritics

Program received signal SIGILL, Illegal instruction.
tesseract::DotProductAVX (u=0x127aa70, v=0x51dbc60, n=25) at dotproductavx.cpp:70
70            __m256d floats2 = _mm256_loadu_pd(v);
(gdb)

amitdo · 2017-07-19T19:28:02Z

./configure --disable-openmp --disable-shared --disable-static

#898 (comment)
#943 (comment)

amitdo · 2017-07-19T19:41:47Z

Change this line in configure.ac
AX_CHECK_COMPILE_FLAG([-mavx], [avx=true], [avx=false])
to
AX_CHECK_COMPILE_FLAG([-mavx], [avx=false], [avx=false])

and recompile tesseract again.

nickbe · 2017-07-19T19:46:49Z

btw. I already uninstalled libleptonica-dev before.

Do I have to "make uninstall" before recompiling?

amitdo · 2017-07-19T19:56:54Z

Do I have to "make uninstall" before recompiling?

You mean make uninstall tesseract ?

You don't have to in this case.

Also, what's the output of cat /proc/cpuinfo | grep flags ?

nickbe · 2017-07-19T20:05:05Z

flags           : fpu tsc msr pae cx8 apic cmov pat clflush mmx fxsr sse sse2 ss syscall nx lm constant_tsc rep_good nopl pni pclmulqdq vmx ssse3 cx16 sse4_1 sse4_2 popcnt aes f16c rdrand hypervisor lahf_lm tpr_shadow vnmi flexpriority ept vpid
flags           : fpu tsc msr pae cx8 apic cmov pat clflush mmx fxsr sse sse2 ss syscall nx lm constant_tsc rep_good nopl pni pclmulqdq vmx ssse3 cx16 sse4_1 sse4_2 popcnt aes f16c rdrand hypervisor lahf_lm tpr_shadow vnmi flexpriority ept vpid
flags           : fpu tsc msr pae cx8 apic cmov pat clflush mmx fxsr sse sse2 ss syscall nx lm constant_tsc rep_good nopl pni pclmulqdq vmx ssse3 cx16 sse4_1 sse4_2 popcnt aes f16c rdrand hypervisor lahf_lm tpr_shadow vnmi flexpriority ept vpid

nickbe · 2017-07-19T20:12:42Z

After recompiling everything with the changed flag the new output is:

Tesseract Open Source OCR Engine v4.00.00alpha with Leptonica
Detected 35 diacritics
DotProductAVX can't be used on Android
DotProductAVX can't be used on Android
Aborted

Android?!

amitdo · 2017-07-19T20:14:44Z

According to the output of cat /proc/cpuinfo | grep flags, your cpu does not support avx.

nickbe · 2017-07-19T20:18:50Z

The latest recompile was already done with the modified configure.ac.
That was the output when running: tesseract -l eng --oem 2 ......
As always --oem 0 works.

amitdo · 2017-07-19T20:23:41Z

OK. You will have to do another change in the code.

I will tell you later/tomorrow what to do next.

amitdo · 2017-07-19T20:42:59Z

In arch/simddetect.h

Change this line
static inline bool IsAVXAvailable() { return detector.avx_available_; }
to
static inline bool IsAVXAvailable() { return false; }

I hope we will finish with this change :-)

stweil · 2017-07-19T20:43:59Z

It's strange that tesseract -v reports Found AVX while your CPU obviously does not support AVX (see output of /proc/cpuinfo. That's causing the crash which you observe. What kind of CPU are you using? Are you running on a virtual machine?

Could you use the GDB debugger to step through the function SIMDDetect::SIMDDetect (in arch/simddetect.cpp) which is executed right at the beginning? Maybe you have a buggy __get_cpuid function (or a buggy virtual machine). Try to print the value of ecx which is set by that function.

Removing the code avx_available_ = (ecx & 0x10000000) != 0; will work around the problem and fix the crash. The change suggested by @amitdo will have the same effect.

amitdo · 2017-07-19T20:45:35Z

and recompile of course.

amitdo · 2017-07-19T20:49:32Z

Do what I said before listening to @stweil :-)

amitdo · 2017-07-19T21:04:01Z

@stweil

Yes. It's strange.
I want to make sure the problem will be solved after disabling (cheating) avx detection.
If that happen, nickbe will need to undo the 2 changes and recompile. Then you will do your analysis...

amitdo · 2017-07-19T21:13:55Z

What kind of CPU are you using?

cat /proc/cpuinfo | grep name

nickbe · 2017-07-19T21:29:30Z

It's a vServer. Probably XEN but I'm not sure. We do use them quite often without problems. So I have no idea why this case is indeed so strange. I'm recompiling now...

nickbe · 2017-07-19T22:01:44Z

Yay. It's working finally. Thanks you so much guys 💃 Will the changes in the make make it into the official repository.?

So now that I can in fact test the new 4.0 feats, is there a way to speed up scanning? Any switches that are recommended?

stweil · 2017-07-20T05:44:44Z

Will the changes in the make make it into the official repository?

No, they won't, because those changes disable AVX support which is highly desired: AVX makes Tesseract faster. The problem is most probably caused by your vServer which returns a wrong cpuid. That cpuid claims that your vServer supports AVX, but it does not. You can try to get more more information on that vServer (is it XEN, which version?) and report the problem.

We could add a Tesseract option to select SSE / AVX (overriding the automatic detection). Then Tesseract would still crash by default in your case, but it would be possible to make it work using that new option.

nickbe · 2017-07-20T06:04:42Z

Is this something new to the 4.00 version? Because the 3.x Versions ran just fine.

stweil · 2017-07-20T06:07:48Z

Yes, it's new. AVX is used for the calculation of the dot product which is needed for LSTM (new in 4.00, not used with --oem 0).

nickbe · 2017-07-20T09:38:26Z

Maybe there's a safer method to detect the capability? Can I find out if other methods show the correct capabilities for you guys?
If you like I'd be happy to grant you access to the server.

amitdo · 2017-07-20T10:19:32Z

#1043 (comment)

nickbe · 2017-07-23T20:54:52Z

No I meant maybe there's a better and more secure way for you guys to recognize these kind of features

stweil · 2017-07-23T20:59:25Z

@nickbe, you could help by providing more information on the kind of vServer which you were using.

nickbe · 2017-07-23T21:33:35Z

Sure.
https://www.df.eu/de/cloud-hosting/
Currently it's the second smallest vServer

stweil · 2017-07-24T05:47:40Z

I just wrote to Domain Factory (in German, translated here):

One of your customers has reported a problem with the OCR application
Tesseract: #1043 (comment)

The cause of the crash seems to be the CPUID seen from the vServer guest. That CPUID does not fit the real hardware:

According to CPUID, the CPU supports AVX operations. In fact, these lead to a crash.

Doesn't your hardware support AVX (maybe an older XEON CPU)? Probably the VM of the customer migrated from newer hardware (with AVX) to an older hardware (without AVX), and now it still uses the CPUID of the newer hardware.

What do you advise users in this case?
You can also reply directly to GitHub (URL above).

XEN can set the CPUID seen by guests to avoid exactly that kind of problem: it can mask the AVX bit even when running on a new CPU with AVX support, thus allowing migration to an older CPU.

stweil · 2017-07-24T05:49:40Z

@nickbe, could you please also run cpuid --one-cpu and cpuid --one-cpu --raw and post the output?

stweil · 2017-07-24T06:01:32Z

Similar problems: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=646549, https://sourceware.org/bugzilla/show_bug.cgi?id=13007. Intel manual: https://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-manual-325462.pdf#G16.42695.

stweil · 2017-07-24T07:52:32Z

Nick, Domain Factory support asks for the name of your Jiffy Box. Could you send me your e-mail address (get my address here)? Then I'll forward their request to you.

amitdo · 2017-09-10T14:45:51Z

@nickbe, did you manage to solve the issue?

stweil · 2017-09-10T16:33:46Z

@amitdo, I had contacted Nick's provider. They use XEN servers which do not support AVX, but the CPUID which is seen from the vServer claims that AVX is available. As far as I have understood, this happens when a XEN vServer initially runs on a server with AVX, but is migrated to another server without AVX later.

Only the provider can handle that correctly. Either the XEN vServer must always run on servers with AVX, or the XEN configuration must disable the AVX settings in CPUID even if the server has AVX support.

On the Tesseract side we could try to get a more robust AVX detection which not only checks CPUID. In addition we need an option or parameter to override the automatic selection of SSE2 / AVX.

amitdo · 2017-09-10T19:10:49Z

Ok, @stweil. Thanks for the info.

nickbe · 2017-09-11T16:30:22Z

hi guys, yes I successfully solved the problem by following your instruction to patch the settings.
Thanks again for your support here. Very appreciated indeed :)

ken4ward · 2021-12-31T11:23:08Z

What does oem mean, and how do I set it in my java project?

zdenop closed this as completed Sep 12, 2017

amitdo added the SIMD label Aug 22, 2022

Tesseract 4 cannot use anything other than --oem 0 #1043

Tesseract 4 cannot use anything other than --oem 0 #1043

Comments

nickbe commented Jul 18, 2017 • edited Loading

Shreeshrii commented Jul 18, 2017

nickbe commented Jul 18, 2017 • edited Loading

nickbe commented Jul 18, 2017

nickbe commented Jul 18, 2017 • edited Loading

Shreeshrii commented Jul 19, 2017 • edited Loading

Shreeshrii commented Jul 19, 2017

nickbe commented Jul 19, 2017

amitdo commented Jul 19, 2017 • edited Loading

nickbe commented Jul 19, 2017 • edited Loading

Shreeshrii commented Jul 19, 2017 via email

amitdo commented Jul 19, 2017

amitdo commented Jul 19, 2017

nickbe commented Jul 19, 2017

amitdo commented Jul 19, 2017

nickbe commented Jul 19, 2017 • edited Loading

nickbe commented Jul 19, 2017

amitdo commented Jul 19, 2017

amitdo commented Jul 19, 2017 • edited Loading

nickbe commented Jul 19, 2017

amitdo commented Jul 19, 2017

nickbe commented Jul 19, 2017 • edited Loading

nickbe commented Jul 19, 2017

amitdo commented Jul 19, 2017 • edited Loading

nickbe commented Jul 19, 2017

amitdo commented Jul 19, 2017

amitdo commented Jul 19, 2017

stweil commented Jul 19, 2017 • edited Loading

amitdo commented Jul 19, 2017

amitdo commented Jul 19, 2017

amitdo commented Jul 19, 2017

amitdo commented Jul 19, 2017

nickbe commented Jul 19, 2017

nickbe commented Jul 19, 2017

stweil commented Jul 20, 2017

nickbe commented Jul 20, 2017

stweil commented Jul 20, 2017 • edited Loading

nickbe commented Jul 20, 2017

amitdo commented Jul 20, 2017

nickbe commented Jul 23, 2017

stweil commented Jul 23, 2017

nickbe commented Jul 23, 2017

stweil commented Jul 24, 2017

stweil commented Jul 24, 2017

stweil commented Jul 24, 2017 • edited Loading

stweil commented Jul 24, 2017

amitdo commented Sep 10, 2017

stweil commented Sep 10, 2017 • edited Loading

amitdo commented Sep 10, 2017

nickbe commented Sep 11, 2017

ken4ward commented Dec 31, 2021

nickbe commented Jul 18, 2017 •

edited

Loading

nickbe commented Jul 18, 2017 •

edited

Loading

nickbe commented Jul 18, 2017 •

edited

Loading

Shreeshrii commented Jul 19, 2017 •

edited

Loading

amitdo commented Jul 19, 2017 •

edited

Loading

nickbe commented Jul 19, 2017 •

edited

Loading

nickbe commented Jul 19, 2017 •

edited

Loading

amitdo commented Jul 19, 2017 •

edited

Loading

nickbe commented Jul 19, 2017 •

edited

Loading

amitdo commented Jul 19, 2017 •

edited

Loading

stweil commented Jul 19, 2017 •

edited

Loading

stweil commented Jul 20, 2017 •

edited

Loading

stweil commented Jul 24, 2017 •

edited

Loading

stweil commented Sep 10, 2017 •

edited

Loading