Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Invalid characters with Tesseract 5.1.0 and tessdata_fast data (for German version) when using 32-bit Microsoft compiler #3769

Closed
krzysiekj94 opened this issue Mar 17, 2022 · 36 comments

Comments

@krzysiekj94
Copy link

Environment

  • Tesseract Version: 5.1.0
  • Platform: Windows 32-bit

Current Behavior:

I have the following problem:

  1. I prepared a custom build for Tesseract 5.1.0, so as to generate dlls, which I then use in the project of a 32-bit .exe application.
  2. I prepared the following dependencies with CMake 3.23 (without SW build):
    a. tesseract 5.1.0, leptonica 1.82.0, libtiff 4.3.0, libjpeg-turbo 2.1.3, zlib 1.2.11, libpng 1.6.37.
    b. Links to src:
  1. After generating the dependencies, I used them in a wrapper that uses CAPI and generated a dll file (also 32 bit) that I used in the application. The list of all dependencies is as follows:
    image
  2. In the next step, I performed an OCR test in the application with tessdata germany data - deu.traineddata model: https://github.com/tesseract-ocr/tessdata_fast.
  3. At this point, I noticed inferior recognition quality compared to the Tesseract 4.1.1 version, which I used earlier.
    image
    a. test file:
    test_file
  4. I noticed that there is also a problem with slash, for example: It is then changed to "jj" - see:
    image
  5. I would like to add that I have also prepared a Tesseract 4.1.1 compilation with the dependencies as in point 2b. The quality of OCR did not change then.
  6. I use tessdata_best as a temporary workaround (and it work), but the OCR speed for this model is not satisfactory for me.

Expected Behavior:

I expect Tesseract 5.1.0 to recognize characters correctly, ie not converting "l", "m" to "j" or "i" to "j" for example in the tessdata_fast mode. I would like character recognition to work similar to Tesseract 4.1.1.

Suggested Fix:

Consideration of an upgrade for deu.traineddata models on the website:
https://github.com/tesseract-ocr/tessdata_fast

@stweil
Copy link
Member

stweil commented Mar 17, 2022

Tesseract 5 still supports the model files from Tesseract 4 with the "legacy mode", so if you are happy with that, you can use it.

@stweil
Copy link
Member

stweil commented Mar 17, 2022

@krzysiekj94, I get a different result:

tesseract https://user-images.githubusercontent.com/12548678/158796308-0e0e8e57-ad24-4eb5-b70a-0c6b99722663.png - -l tessdata_fast/deu
Siegfried Aalfelden
Kurt-Schumacher-Platz 10
13405 Berlin

26.02.2019

Sehr geehrter Herr Aalfelden,

Informationen

Das Internet (von englisch internetwork, zusammengesetzt aus dem Präfix inter und
network „Netzwerk“ oder kurz net ‚Netz‘), umgangssprachlich auch Netz, ist ein
weltweiter Verbund von Rechnernetzwerken, den autonomen Systemen. Es ermöglicht
die Nutzung von Internetdiensten wie WWW, E-Mail, Telnet, SSH, XMPP, MOTT und
FTP. Dabei kann sich jeder Rechner mit jedem anderen Rechner verbinden. Der
Datenaustausch zwischen den über das Internet verbundenen Rechnern erfolgt über
die technisch normierten Internetprotokolle. Die Technik des Internets wird durch die
RFCs der Internet Engineering Task Force (IETF) beschrieben.

Die Verbreitung des Internets hat zu umfassenden Umwälzungen in vielen
Lebensbereichen geführt. Es trug zu einem Modernisierungsschub in vielen
Wirtschaftsbereichen sowie zur Entstehung neuer Wirtschaftszweige bei und hat zu
einem grundlegenden Wandel des Kommunikationsverhaltens und der Mediennutzung
im beruflichen und privaten Bereich geführt. Die kulturelle Bedeutung dieser
Entwicklung wird manchmal mit der Erfindung des Buchdrucks gleichgesetzt.

Die Übertragung von Daten im Internet unabhängig von ihrem Inhalt, dem Absender
und dem Empfänger wird als Netzneutralität bezeichnet.

Mit freundlichen Grüßen

@krzysiekj94
Copy link
Author

Hello @stweil , thanks for response. I have more questions now:

  1. Where exactly to get the tessdata_fast data for "legacy mode"?
    Based on the documentation https://github.com/tesseract-ocr/tessdata_fast I can see that legacy mode is not supported.
  2. Can support legacy for tesseract be enforced when building a solution via CMakeList.txt from CMake? If so, where is that option?
    It's my preview of CMake with tesseract:
    test
  3. On which version of Tesseract did you get the correct result? Where can I get the exact version you used? Is it 32-bit or 64-bit? Can you send a link to this version?
  4. Is it possible that there is a different behavior for tessdata_fast on 32-bit and 64-bit versions of Tesseract 5.0.1?
  5. Can there be problems with the use of CAPI? Perhaps I should switch to object oriented programming?
    Below is an example of initialize a tesseract in my wrapper code:
    image
    image

Thanks in advance for your answer! Have a nice day.

@stweil
Copy link
Member

stweil commented Mar 17, 2022

Please try the OCR with the default tesseract application. If that works fine (like it does in my test) you have to find out what you have to fix in your application.

@stweil
Copy link
Member

stweil commented Mar 17, 2022

Please use the Tesseract user forum for questions. The GitHub issues are not a support forum.

You might try the Windows binaries from https://github.com/UB-Mannheim/tesseract/wiki/.

@zdenop
Copy link
Contributor

zdenop commented Mar 17, 2022

Legacy model is available only in https://github.com/tesseract-ocr/tessdata.

@Shreeshrii
Copy link
Collaborator

I use tessdata_best as a temporary workaround (and it work), but the OCR speed for this model is not satisfactory for me.

Then please try, as suggested above, with model from https://github.com/tesseract-ocr/tessdata which has legacy models as well as the 'fast' version of 'tessdata_best' models. Both are available in the same traineddata file, invoked with different --oem settings.

@krzysiekj94
Copy link
Author

krzysiekj94 commented Mar 23, 2022

Please try the OCR with the default tesseract application. If that works fine (like it does in my test) you have to find out what you have to fix in your application.

  1. After installing the Mannheim installation, the OCR seems to form fine, but I don't know where it comes from. In case I prepared the console version 32-bit of the tesseract.exe application myself, it works incorrectly with the tessdata_fast data -> see:
    image. In my opinion it may be something related to the Visual Studio compiler? Is this a good direction? I'll check it out again....

@krzysiekj94
Copy link
Author

krzysiekj94 commented Mar 23, 2022

I found one of the articles that seems to be similar to my problem: #3283
I did one of the tests and changed the option related to O2 optimization to disabled. I was very surprised because disabling /O2 optimization caused OCR to return almost identical texts as in tesseract 4.1.1, which I expected. See below for settings in Visual Studio 2019 and for differences in text:

image

image

Attention:
Now I have a question: Would it be wise if I compiled tesseractlib 5.1.0 with the /O2 optimization option turned off? Does this have any unexpected consequences? Maybe someone has similar problems and experiences? I feel close to solving the problem.

@stweil
Copy link
Member

stweil commented Mar 23, 2022

Related issues: #2898 and #3283.

@stweil
Copy link
Member

stweil commented Mar 23, 2022

After installing the Mannheim installation, the OCR seems to form fine, but I don't know where it comes from.

The UB Mannheim binaries are build with the GNU compiler. Therefore they don't have this issue.

@stweil stweil changed the title Invalid characters when recognizing in Tesseract 5.1.0 with tessdata_fast data (for German version) Invalid characters with Tesseract 5.1.0 and tessdata_fast data (for German version) when using 32-bit Microsoft compiler Mar 23, 2022
@zdenop
Copy link
Contributor

zdenop commented Mar 23, 2022

64bit works for me.:

>tesseract -v
tesseract 5.1.0-7-g0e526
 leptonica-1.83.0 (Jan 26 2022, 19:15:03) [MSC v.1929 LIB Release x64]
  libgif 5.2.1 : libjpeg 6b (libjpeg-turbo 2.0.91) : libpng 1.6.37 : libtiff 4.3.0 : zlib 1.2.11
 Found AVX2
 Found AVX
 Found FMA
 Found SSE4.1
 Found OpenMP 2019
 Found libarchive 3.5.1 zlib/1.2.11 liblzma/5.2.4 bz2lib/1.0.6 libzstd/1.4.9
 Found libcurl/7.75.0 zlib/1.2.11 libssh2/1.10.1_DEV
>tesseract i3769.png - -l tessdata_fast/deu
Siegfried Aalfelden
Kurt-Schumacher-Platz 10
13405 Berlin

26.02.2019

Sehr geehrter Herr Aalfelden,

Informationen

Das Internet (von englisch internetwork, zusammengesetzt aus dem Präfix inter und
network „Netzwerk“ oder kurz net ‚Netz‘), umgangssprachlich auch Netz, ist ein
weltweiter Verbund von Rechnernetzwerken, den autonomen Systemen. Es ermöglicht
die Nutzung von Internetdiensten wie WWW, E-Mail, Telnet, SSH, XMPP, MOTT und
FTP. Dabei kann sich jeder Rechner mit jedem anderen Rechner verbinden. Der
Datenaustausch zwischen den über das Internet verbundenen Rechnern erfolgt über
die technisch normierten Internetprotokolle. Die Technik des Internets wird durch die
RFCs der Internet Engineering Task Force (IETF) beschrieben.

Die Verbreitung des Internets hat zu umfassenden Umwälzungen in vielen
Lebensbereichen geführt. Es trug zu einem Modernisierungsschub in vielen
Wirtschaftsbereichen sowie zur Entstehung neuer Wirtschaftszweige bei und hat zu
einem grundlegenden Wandel des Kommunikationsverhaltens und der Mediennutzung
im beruflichen und privaten Bereich geführt. Die kulturelle Bedeutung dieser
Entwicklung wird manchmal mit der Erfindung des Buchdrucks gleichgesetzt.

Die Übertragung von Daten im Internet unabhängig von ihrem Inhalt, dem Absender
und dem Empfänger wird als Netzneutralität bezeichnet.

Mit freundlichen Grüßen

@zdenop
Copy link
Contributor

zdenop commented Mar 23, 2022

Can you try /Ox instead of /O2?

@krzysiekj94
Copy link
Author

krzysiekj94 commented Mar 24, 2022

Hello @zdenop .

1). The problem still exists with the use of \ Ox. OCR returns the same result as the \O2 flag.

image

image

2). In the case of the / O1 flag, the results are even worse:

image

@krzysiekj94
Copy link
Author

64bit works for me.:

>tesseract -v
tesseract 5.1.0-7-g0e526
 leptonica-1.83.0 (Jan 26 2022, 19:15:03) [MSC v.1929 LIB Release x64]
  libgif 5.2.1 : libjpeg 6b (libjpeg-turbo 2.0.91) : libpng 1.6.37 : libtiff 4.3.0 : zlib 1.2.11
 Found AVX2
 Found AVX
 Found FMA
 Found SSE4.1
 Found OpenMP 2019
 Found libarchive 3.5.1 zlib/1.2.11 liblzma/5.2.4 bz2lib/1.0.6 libzstd/1.4.9
 Found libcurl/7.75.0 zlib/1.2.11 libssh2/1.10.1_DEV
>tesseract i3769.png - -l tessdata_fast/deu
Siegfried Aalfelden
Kurt-Schumacher-Platz 10
13405 Berlin

26.02.2019

Sehr geehrter Herr Aalfelden,

Informationen

Das Internet (von englisch internetwork, zusammengesetzt aus dem Präfix inter und
network „Netzwerk“ oder kurz net ‚Netz‘), umgangssprachlich auch Netz, ist ein
weltweiter Verbund von Rechnernetzwerken, den autonomen Systemen. Es ermöglicht
die Nutzung von Internetdiensten wie WWW, E-Mail, Telnet, SSH, XMPP, MOTT und
FTP. Dabei kann sich jeder Rechner mit jedem anderen Rechner verbinden. Der
Datenaustausch zwischen den über das Internet verbundenen Rechnern erfolgt über
die technisch normierten Internetprotokolle. Die Technik des Internets wird durch die
RFCs der Internet Engineering Task Force (IETF) beschrieben.

Die Verbreitung des Internets hat zu umfassenden Umwälzungen in vielen
Lebensbereichen geführt. Es trug zu einem Modernisierungsschub in vielen
Wirtschaftsbereichen sowie zur Entstehung neuer Wirtschaftszweige bei und hat zu
einem grundlegenden Wandel des Kommunikationsverhaltens und der Mediennutzung
im beruflichen und privaten Bereich geführt. Die kulturelle Bedeutung dieser
Entwicklung wird manchmal mit der Erfindung des Buchdrucks gleichgesetzt.

Die Übertragung von Daten im Internet unabhängig von ihrem Inhalt, dem Absender
und dem Empfänger wird als Netzneutralität bezeichnet.

Mit freundlichen Grüßen

In my case, unfortunately, I can't use the x64 version because I have a 32-bit application that uses Tesseract's .dll's :(

@amitdo
Copy link
Collaborator

amitdo commented Mar 24, 2022

Now I have a question: Would it be wise if I compiled tesseractlib 5.1.0 with the /O2 optimization option turned off? Does this have any unexpected consequences?

You tried it yourself with a good result. The expected consequence is much slower program execution.

Which version of MSVC 2019 exactly do you use? If it's not lhe latest one (16.11.11), can you upgrade to the latest one and retest?

If the issue still exist with the latest MSVC 2019 version, I suggest to send a new bug report to Microsoft, or reuse this one: https://developercommunity2.visualstudio.com/t/1336629.

@amitdo amitdo closed this as completed Mar 25, 2022
@amitdo
Copy link
Collaborator

amitdo commented Mar 25, 2022

This is similar to issue #3283.

I closed this issue because it seems to be an issue with MSVC, not with Tesseract.

If a future version of MSVC will solve the issue, let us know.

@amitdo amitdo added the msvc label Mar 25, 2022
@krzysiekj94
Copy link
Author

krzysiekj94 commented Mar 25, 2022

At the moment I'm using VS version 16.9.6 (older version) but I compiled on a different computer with the same VS 2019 x86 version. Interestingly, with /O2 optimization, but without AVX2, OCR works fine. Why? I do not know.

Edit: However, I noticed that after copying the generated Tesseract from a computer without AVX2 support, the problem occurs with copied dll's on a computer that supports AVX2. So I'll have to check on VS 16.11.11 anyway.

obraz

@stweil
Copy link
Member

stweil commented Mar 26, 2022

Interestingly, with /O2 optimization, but without AVX2, OCR works fine.

So the Microsoft compiler creates buggy code with /O2 for intsimdmatrixavx2.cpp.

@krzysiekj94, you could try to add #pragma optimize( "", off ) in that file and test whether that fixes the issue. If that works, you could also try #pragma optimize( "s", on ) as an additional pragma.

@krzysiekj94
Copy link
Author

krzysiekj94 commented Mar 26, 2022

Interestingly, with /O2 optimization, but without AVX2, OCR works fine.

So the Microsoft compiler creates buggy code with /O2 for intsimdmatrixavx2.cpp.

@krzysiekj94, you could try to add #pragma optimize( "", off ) in that file and test whether that fixes the issue. If that works, you could also try #pragma optimize( "s", on ) as an additional pragma.

Hello @stweil ,

1). It looks like after adding only #pragma optimize( "", off ) in the intsimdmatrixavx2.cpp works - see code and comparing results:
obraz
obraz

2). After adding only #pragma optimize( "s", on ) in the intsimdmatrixavx2.cpp - you can see that quality OCR is worse

obraz
obraz

3). After adding #pragma optimize( "", off ) and #pragma optimize( "s", on ) together - I have the same result as when I added only #pragma optimize( "", off )

obraz
obraz

My question is: I understand that by "you could also try #pragma optimize (" s ", on) as an additional pragma" you mean using these two #pragma together - as in step 3?

@krzysiekj94
Copy link
Author

Now I have a question: Would it be wise if I compiled tesseractlib 5.1.0 with the /O2 optimization option turned off? Does this have any unexpected consequences?

You tried it yourself with a good result. The expected consequence is much slower program execution.

Which version of MSVC 2019 exactly do you use? If it's not lhe latest one (16.11.11), can you upgrade to the latest one and retest?

If the issue still exist with the latest MSVC 2019 version, I suggest to send a new bug report to Microsoft, or reuse this one: https://developercommunity2.visualstudio.com/t/1336629.

@amitdo On version 16.11.11 the problem still recurs. I checked it.

@stweil
Copy link
Member

stweil commented Mar 26, 2022

My question is: I understand that by "you could also try #pragma optimize (" s ", on) as an additional pragma" you mean using these two #pragma together - as in step 3?

Yes, that's right. The first pragma disables the optimization options from your build environment. This was expected to work, but disabling all optimizations might result in bad performance. The second pragma therefore enables size optimization (similar to compiler option /Os). See the Microsoft doumentation for details.

Now those two pragma statements should be included conditionally, namely only for 32 bit builds and those compiler versions which show the bug. Maybe you can find out how this can be done with preprocessor conditionals. Then that code lines can be added to the official code.

@krzysiekj94
Copy link
Author

I have added below a suggestion for a fix VS x86 version 16.5 - 16.11 (https://docs.microsoft.com/en-us/cpp/preprocessor/predefined-macros?view=msvc-170).

See: https://pastebin.com/K1q5PzRf

obraz

@stweil
Copy link
Member

stweil commented Mar 26, 2022

That looks good. Do you want to send a pull request? Then just add a comment (and an empty line after line 16).

@zdenop
Copy link
Contributor

zdenop commented Mar 27, 2022

The problem is with the 32-bit build only so there should be a check for 64-bit (_WIN64) build as _WIN32 is defined for 32 and 64-bit build.

@amitdo
Copy link
Collaborator

amitdo commented Mar 27, 2022

In #3283, Windows 10 64-bit with VS 2019 32-bit build was used. How can we detect this combination?

@stweil
Copy link
Member

stweil commented Mar 27, 2022

Only a built time check is needed (#if ... defined(_WIN32) && !defined(_WIN64) ...). The resulting 32 bit code fails on both 32 and 64 bit Windows.

@krzysiekj94
Copy link
Author

krzysiekj94 commented Mar 27, 2022

@amitdo
Hmm, from what I can see after changing the x86 / x64 compilation in combobox, the #pragma section turns on / off - after adding defined (WIN32) - this is probably used by MS to detect compile mode. Maybe it it's way to solve this problem?

obraz
obraz

Code below:

#if defined(_MSC_VER) && defined(_WIN32) && defined(WIN32) && _MSC_VER >= 1925 && _MSC_VER <= 1929
#pragma optimize("", off)
#pragma optimize("s", on)
#endif

Article showing differences with using _WIN32 & WIN32: https://accu.org/journals/overload/24/132/wilson_2223/
I hope I understood it correctly.

@zdenop
Copy link
Contributor

zdenop commented Mar 27, 2022

WIN32 is defined by the SDK or the build environment, so it does not use the implementation reserved namespace

see: https://stackoverflow.com/questions/662084/whats-the-difference-between-the-win32-and-win32-defines-in-c

The non-underscore WIN32 is not well documented and appears to have no bearing on 32 vs 64 machine type. Standard Visual C++ projects for Windows generally don't appear to use it (it may not be in use at all).

see: https://stackoverflow.com/questions/17380340/win32-preprocessor-definition-in-64bit-windows-platform/51682888#51682888

Also https://docs.microsoft.com/en-us/cpp/preprocessor/predefined-macros?view=msvc-160 mentions only _WIN32 / _WIN64

@stweil
Copy link
Member

stweil commented Mar 27, 2022

That's right, and it should be sufficient to use only those two official macros (see my previous comment).

@BJungmann
Copy link

I noticed that after copying the generated Tesseract from a computer without AVX2 support, the problem occurs with copied dll's on a computer that supports AVX2

When I did my tests for #3283, I also tried to disable AVX2 usage with the statement
avx2_available_ = false;
in src/arch/simddetect.cpp, line 200. This is where the decision is made at runtime, so it should work on machines that have avx2 available.
This gave good results, and did not slow down that much. So I suggest to enclose this statement in the proper _MSC_VER macros, rather than turning off the optimizations with #pragma optimize.

On version 16.11.11 the problem still recurs. I checked it.

Thank you for checking that. So my preferred workaround ist still using VS 2019 with platform toolset v141 (which belongs to VS 2017) - you need no code patch then.

@krzysiekj94
Copy link
Author

krzysiekj94 commented Mar 27, 2022

Hi @BJungmann,

1). Thanks for the suggestion for the version for VS 2017 version. I made a sample build for version 15.9.45 Community - see below:
image
2). I did performance tests - here the differences are diametrical excluding optimization in intsimdmatrixavx2.cpp - see below test for 7 page tiff in favor of VS 2017. I haven't checked your patch, but I think it will be above that time as well.
The OCR results are identical.
image

3). IMO, it seems that any change from #pragma will increase the OCR execution time...
I saw that you have already reported this problem to the microsoft team, but it has status: "Closed - Not Enough Info" - https://developercommunity2.visualstudio.com/t/1336629. Will you report this issue to microsoft again? I was wondering whether to do it myself, but maybe you already have it in your plans?

@BJungmann
Copy link

Indeed execution time with the avx2_available_ patch is increased, but considerably less than with all optimizations turned off. This is the reason why I still recommend platform toolset v141.
I have no current plans to start a new effort with Microsoft. They like very short demonstration code for bug reports. A short main program and data set that shows wrong results would be feasible. But I do not understand enough details in the tesseract code using the MatrixDotVector functions, to see which call to which function with which data produces different results if executed with AVX2 hardware.

@amitdo amitdo reopened this Mar 30, 2022
@amitdo
Copy link
Collaborator

amitdo commented Mar 30, 2022

@stweil, can you push a workaround for this issue?

stweil added a commit to stweil/tesseract that referenced this issue Mar 30, 2022
@stweil
Copy link
Member

stweil commented Mar 30, 2022

@stweil, can you push a workaround for this issue?

Something like #3778?

@amitdo
Copy link
Collaborator

amitdo commented Mar 30, 2022

Yes :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants