-
Notifications
You must be signed in to change notification settings - Fork 9.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tesseract 4 cannot use anything other than --oem 0 #1043
Comments
what is the version of your traineddata files? Download latest version from the tessdata repo. |
Ok, so now I reinstalled tesseract just to make sure I did everything right. Current content: Now Tesseract starts but tells me that it can't load any language. Which is quite odd.
and And whatever I set the TESSDATA_PREFIX to, (like TESSDATA_PREFIX=/usr/share/tesseract-ocr/tessdata) does not get honored at all. |
Ok, I solved the language problem. After unsetting TESSDATA_PREFIX and simply using: But still --oem 1 results in:
|
When using the data files from: But when using the data files from: https://github.com/tesseract-ocr/tesseract/wiki/Data-Files Both ways I put the files into /usr/local/share/tessdata |
Test with the tif file in testing directory. It works ok for me.
|
When you say ' Tesseract 4.00 Git Version' I take it to mean that you are using the latest source from github to build tesseract. |
That's correct. |
Please test tesseract with phototest.tif, as Shree suggested. |
OK. I tested it with the traineddata above. But also it's the same I'm using here. But again the phototest.tif works fine with --oem 0 and results in the same error "illegal instructions" for any other --oem option or none (default should be --oem 2 if I'm not mistaken) And although compilation seemed fine. I didn't see an error or warning. So I guess there must be some library missing here. Also I reinstalled Leptonica and Tesseract multiple times now. Here's how I've installed the tools:
|
build with --enable-debug and run with gdb to get additional info.
ShreeDevi
…____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
On Wed, Jul 19, 2017 at 11:05 PM, Nick ***@***.***> wrote:
OK. I tested it with the traineddata above. But also it's the same I'm
using here.
I also confirmed that tesseract in indeed using the right data folder.
But again the phototest.tif works fine with --oem 0 and results in the
same error "illegal instructions" for any other --oem option or none
(default should be --oem 2 if I'm not mistaken)
And although compilation seemed fine. I didn't see an error or warning. So
I guess there must be some library missing here.
Also I reinstalled Leptonica and Tesseract multiple times now.
Here's how I've installed the tools:
1. Make sure that the following libraries are installed:
# nickbe: I had to replace libpng12-dev for debian jessie
apt-get install autoconf-archive automake g++ libtool libleptonica-dev pkg-config
apt-get install libpango1.0-dev
# sudo apt-get install g++ # or clang++ (presumably)
sudo apt-get install autoconf automake libtool
sudo apt-get install autoconf-archive
sudo apt-get install pkg-config
sudo apt-get install libpng12-dev
sudo apt-get install libjpeg-turbo
sudo apt-get install libtiff5-dev
sudo apt-get install zlib1g-dev
sudo apt-get install libicu-dev
sudo apt-get install libpango1.0-dev
sudo apt-get install libcairo2-dev
2. Install Leptonica:
git clone --depth 1 https://github.com/DanBloomberg/leptonica.git leptonica
cd leptonica
./autobuild
./configure
make
sudo make install
ldconfig
3. Install Tesseract:
git clone --depth 1 https://github.com/tesseract-ocr/tesseract.git tesseract-ocr
cd tesseract-ocr
./autogen.sh
./configure --disable-openmp --disable-shared --disable-static
or
./configure # nickbe: I TESTED BOTH CONFIGURATIONS JUST TO MAKE SURE
make
sudo make install
sudo ldconfig
# sudo make training
# sudo make training-install
sudo make install-langs # nickbe: Never does anything so far
sudo ldconfig
4. wget tessdata from https://github.com/tesseract-ocr/tesseract/wiki/Data-Files
to /usr/local/share/tessdata
Example: wget https://github.com/tesseract-ocr/tessdata/raw/4.00/eng.traineddata
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#1043 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AE2_oxqBBceuQv-LRbrZSoOB_RyHelUCks5sPj5ngaJpZM4ObwNC>
.
|
Do not install |
apt-get uninstall libleptonica-dev |
OK I enabled debug. I also installed the gdb package, but I have no experience with it. How can I provide more information? |
The comment is correct, so there's no point in doing that. |
I think one quite important information is that I just installed the package on a fresh instance of debian stretch. I had no problems with installating Leptonica or Tesseract, but after everything was installed I have exactly the same behaviour on this machine. Tesseract runs with --oem 0 but throws the "illegal instruction" message when trying to use --oem 2 or 1. Seems that there's something very profound missing in the installation procedure. |
Ok. I managed to run Tesseract with gdb. Here's the output:
|
|
Change this line in configure.ac and recompile tesseract again. |
btw. I already uninstalled libleptonica-dev before. Do I have to "make uninstall" before recompiling? |
You mean You don't have to in this case. Also, what's the output of |
|
After recompiling everything with the changed flag the new output is:
Android?! |
According to the output of |
The latest recompile was already done with the modified configure.ac. |
OK. You will have to do another change in the code. I will tell you later/tomorrow what to do next. |
In arch/simddetect.h Change this line I hope we will finish with this change :-) |
It's strange that Could you use the GDB debugger to step through the function Removing the code |
and recompile of course. |
Do what I said before listening to @stweil :-) |
Yes. It's strange. |
|
It's a vServer. Probably XEN but I'm not sure. We do use them quite often without problems. So I have no idea why this case is indeed so strange. I'm recompiling now... |
Yay. It's working finally. Thanks you so much guys 💃 Will the changes in the make make it into the official repository.? So now that I can in fact test the new 4.0 feats, is there a way to speed up scanning? Any switches that are recommended? |
No, they won't, because those changes disable AVX support which is highly desired: AVX makes Tesseract faster. The problem is most probably caused by your vServer which returns a wrong cpuid. That cpuid claims that your vServer supports AVX, but it does not. You can try to get more more information on that vServer (is it XEN, which version?) and report the problem. We could add a Tesseract option to select SSE / AVX (overriding the automatic detection). Then Tesseract would still crash by default in your case, but it would be possible to make it work using that new option. |
Is this something new to the 4.00 version? Because the 3.x Versions ran just fine. |
Yes, it's new. AVX is used for the calculation of the dot product which is needed for LSTM (new in 4.00, not used with |
Maybe there's a safer method to detect the capability? Can I find out if other methods show the correct capabilities for you guys? |
No I meant maybe there's a better and more secure way for you guys to recognize these kind of features |
@nickbe, you could help by providing more information on the kind of vServer which you were using. |
Sure. |
I just wrote to Domain Factory (in German, translated here):
XEN can set the CPUID seen by guests to avoid exactly that kind of problem: it can mask the AVX bit even when running on a new CPU with AVX support, thus allowing migration to an older CPU. |
@nickbe, could you please also run |
Nick, Domain Factory support asks for the name of your Jiffy Box. Could you send me your e-mail address (get my address here)? Then I'll forward their request to you. |
@nickbe, did you manage to solve the issue? |
@amitdo, I had contacted Nick's provider. They use XEN servers which do not support AVX, but the CPUID which is seen from the vServer claims that AVX is available. As far as I have understood, this happens when a XEN vServer initially runs on a server with AVX, but is migrated to another server without AVX later. Only the provider can handle that correctly. Either the XEN vServer must always run on servers with AVX, or the XEN configuration must disable the AVX settings in CPUID even if the server has AVX support. On the Tesseract side we could try to get a more robust AVX detection which not only checks CPUID. In addition we need an option or parameter to override the automatic selection of SSE2 / AVX. |
Ok, @stweil. Thanks for the info. |
hi guys, yes I successfully solved the problem by following your instruction to patch the settings. |
What does oem mean, and how do I set it in my java project? |
Platform is Debian Jessie - Tesseract 4.00 Git Version.
Platform: Linux localhost 4.4.27-x86_64-jb1 #4 SMP Tue Jun 6 14:41:09 CEST 2017 x86_64 GNU/Linux
Tesseract crashes with "Illegal Instruction" when using anything other than --oem 0
Tesseract -v reports
I can scan with --oem 0 though.
The text was updated successfully, but these errors were encountered: