-
Notifications
You must be signed in to change notification settings - Fork 9.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Invalid characters with Tesseract 5.1.0 and tessdata_fast data (for German version) when using 32-bit Microsoft compiler #3769
Comments
Tesseract 5 still supports the model files from Tesseract 4 with the "legacy mode", so if you are happy with that, you can use it. |
@krzysiekj94, I get a different result:
|
Hello @stweil , thanks for response. I have more questions now:
Thanks in advance for your answer! Have a nice day. |
Please try the OCR with the default |
Please use the Tesseract user forum for questions. The GitHub issues are not a support forum. You might try the Windows binaries from https://github.com/UB-Mannheim/tesseract/wiki/. |
Legacy model is available only in https://github.com/tesseract-ocr/tessdata. |
Then please try, as suggested above, with model from https://github.com/tesseract-ocr/tessdata which has legacy models as well as the 'fast' version of 'tessdata_best' models. Both are available in the same traineddata file, invoked with different --oem settings. |
I found one of the articles that seems to be similar to my problem: #3283 Attention: |
The UB Mannheim binaries are build with the GNU compiler. Therefore they don't have this issue. |
64bit works for me.:
|
Can you try /Ox instead of /O2? |
Hello @zdenop . 1). The problem still exists with the use of \ Ox. OCR returns the same result as the \O2 flag. 2). In the case of the / O1 flag, the results are even worse: |
In my case, unfortunately, I can't use the x64 version because I have a 32-bit application that uses Tesseract's .dll's :( |
You tried it yourself with a good result. The expected consequence is much slower program execution. Which version of MSVC 2019 exactly do you use? If it's not lhe latest one (16.11.11), can you upgrade to the latest one and retest? If the issue still exist with the latest MSVC 2019 version, I suggest to send a new bug report to Microsoft, or reuse this one: https://developercommunity2.visualstudio.com/t/1336629. |
This is similar to issue #3283. I closed this issue because it seems to be an issue with MSVC, not with Tesseract. If a future version of MSVC will solve the issue, let us know. |
At the moment I'm using VS version 16.9.6 (older version) but I compiled on a different computer with the same VS 2019 x86 version. Interestingly, with /O2 optimization, but without AVX2, OCR works fine. Why? I do not know. Edit: However, I noticed that after copying the generated Tesseract from a computer without AVX2 support, the problem occurs with copied dll's on a computer that supports AVX2. So I'll have to check on VS 16.11.11 anyway. |
So the Microsoft compiler creates buggy code with @krzysiekj94, you could try to add |
Hello @stweil , 1). It looks like after adding only #pragma optimize( "", off ) in the intsimdmatrixavx2.cpp works - see code and comparing results: 2). After adding only #pragma optimize( "s", on ) in the intsimdmatrixavx2.cpp - you can see that quality OCR is worse 3). After adding #pragma optimize( "", off ) and #pragma optimize( "s", on ) together - I have the same result as when I added only #pragma optimize( "", off ) My question is: I understand that by "you could also try #pragma optimize (" s ", on) as an additional pragma" you mean using these two #pragma together - as in step 3? |
@amitdo On version 16.11.11 the problem still recurs. I checked it. |
Yes, that's right. The first pragma disables the optimization options from your build environment. This was expected to work, but disabling all optimizations might result in bad performance. The second pragma therefore enables size optimization (similar to compiler option Now those two |
I have added below a suggestion for a fix VS x86 version 16.5 - 16.11 (https://docs.microsoft.com/en-us/cpp/preprocessor/predefined-macros?view=msvc-170). |
That looks good. Do you want to send a pull request? Then just add a comment (and an empty line after line 16). |
The problem is with the 32-bit build only so there should be a check for 64-bit ( |
In #3283, Windows 10 64-bit with VS 2019 32-bit build was used. How can we detect this combination? |
Only a built time check is needed ( |
@amitdo Code below: #if defined(_MSC_VER) && defined(_WIN32) && defined(WIN32) && _MSC_VER >= 1925 && _MSC_VER <= 1929 Article showing differences with using _WIN32 & WIN32: https://accu.org/journals/overload/24/132/wilson_2223/ |
Also https://docs.microsoft.com/en-us/cpp/preprocessor/predefined-macros?view=msvc-160 mentions only |
That's right, and it should be sufficient to use only those two official macros (see my previous comment). |
When I did my tests for #3283, I also tried to disable AVX2 usage with the statement
Thank you for checking that. So my preferred workaround ist still using VS 2019 with platform toolset v141 (which belongs to VS 2017) - you need no code patch then. |
Hi @BJungmann, 1). Thanks for the suggestion for the version for VS 2017 version. I made a sample build for version 15.9.45 Community - see below: 3). IMO, it seems that any change from #pragma will increase the OCR execution time... |
Indeed execution time with the avx2_available_ patch is increased, but considerably less than with all optimizations turned off. This is the reason why I still recommend platform toolset v141. |
@stweil, can you push a workaround for this issue? |
Signed-off-by: Stefan Weil <[email protected]>
Yes :-) |
Environment
Current Behavior:
I have the following problem:
a. tesseract 5.1.0, leptonica 1.82.0, libtiff 4.3.0, libjpeg-turbo 2.1.3, zlib 1.2.11, libpng 1.6.37.
b. Links to src:
c. I also fix CMakeList.txt a bit for tesseract to be able to generate .dll files - see:
CMakeLists.txt
a. test file:
Expected Behavior:
I expect Tesseract 5.1.0 to recognize characters correctly, ie not converting "l", "m" to "j" or "i" to "j" for example in the tessdata_fast mode. I would like character recognition to work similar to Tesseract 4.1.1.
Suggested Fix:
Consideration of an upgrade for deu.traineddata models on the website:
https://github.com/tesseract-ocr/tessdata_fast
The text was updated successfully, but these errors were encountered: