-
Notifications
You must be signed in to change notification settings - Fork 193
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Slow OCR with tesseract 4.00alpha #285
Comments
Uh no idea, there are actually only three calls to tesseract really, see [1], so I'm not doing anything terribly fancy client side. Might be worth asking upstream. [1] https://github.com/manisandro/gImageReader/blob/master/gtk/src/Recognizer.cc#L490 |
Hm strange, the difference between command line tesseract and gImageReader is even bigger 2.4s vs. 1:30 min! Both use all available CPU cores. Yes, asking upstream seems reasonable. Probably someone should double check first? |
Are you testing in plain text or hOCR mode or both? Any difference between the two? |
Plain text before, now hOCR: tesseract: 2.7s vs. gImageReader 1:11 min |
:\ That's pretty disasterous |
Can you try playing with the page segmentation modes in the menu of the recognition button? |
No changes in speed, but option two (from top) gives a crash: https://pastebin.com/Xwxqxbes while On the console:
|
I'll build the latest tesseract git and do some testing. |
The main difference I see is that tesseract 4.00 uses multiple threads while 3.04 does not. |
There are indeed reports of openmp causing slowdowns, i.e. tesseract-ocr/tesseract#961 |
Well, but that doesn't explains why tesseract 4.00. with openmp is fast for me |
Uhm, so testing latest gimagereader git and tesseract-4.00.gitbc668da, recognizing a 10 page document in hOCR mode takes 46 seconds, while 3.05.00 takes 38 seconds. So not that much of a slowdown. Tesseract compiled without passing any particular options to |
Using upstreams tesseract without modification didn't change anything, I used gImageReader 3.2.3 though, it uses all my 4 cores - using the gtk version |
Does it depend on the language used? Any particularly complex document? Can you try with latest gimagereader-git? |
No, language doesn't change anything. The test image is here: https://imgur.com/a/dkpHz - not complex at all. |
Hmm processes it in about 3secs here |
latest git doesn't change anything. but setting Contrast to 25 reduces the time to ~4 secs |
Which leptonica version do you have? |
leptonica 1.74.4 Strange, applying contrast (e.g. removing the gray background) with gimp doesn't change the recognition time. But changing anything in the image processing tab of gimagereader reduces the time to ~4 secs. |
The image processing code uses OpenMP itself [1] (and is actually the only part of gImageReader itself that does). I'm not sufficiently knowledgeable about the OpenMP internals, but it looks like merely executing that OpenMP block before tesseract does somehow affects how the OpenMP blocks are run in tesseract. To confirm this theory, you could try adding a dummy loop somewhere in main.cpp like
and see whether it affects tesseract performance. Also, would be interesting to see how many threads are used in tesseract with/without the OpenMP blocks executing in gImageReader. [1] https://github.com/manisandro/gImageReader/blob/master/gtk/src/DisplayRenderer.cc#L41 |
Indeed, that fixes the problem. I pushed a minimal fix here: #286 |
Thanks - I'll try and make a minimal reproducer and then I'll report the issue upstream. If there is no satisfactory resolution in time for the upcoming 3.3.0 release, I'll add the workaround. |
I'm still having trouble actually reproducing this. Can you reproduce the slowdown with a minimal example like the one below? Possibly try adding some random omp-parallelized loop before and after the
|
No, I can't reproduce the slowdown with this example, regardless where/if I put a omp-parallelized loop (and linking with -fopenmp). |
I've observed machines without AVX2 (pre-Haswell, 2013) will run about 10 times slower than Tesseract 3.05, especially when "best" training data is used. This is reproducible by disabling Details here: |
Ah, that actually explains it :-) I have a i5-3470 (pre-Haswell) here. Added this info to upstream bug report: tesseract-ocr/tesseract#1278 . Still odd why this simple 'fix' #286 works... |
I built gImageReader with latest Tess and tested with fast data under Fedora, don not have slow speed issue. But slow OCR occurs when I built gImageReader with latest Tess compile by MinGW for Windows. anyone has idea to solve it ? |
Closing since non a gImageReader issue. |
After updating to Ubuntu 22.04, gImageReader became incredibly slow for me. Dev manisandro was very helpful and led me to a quick and dirty cli solution for running on a single thread. 'Works wonderfully for me.
If you want to check that you actually are running on one thread, type:
Then run gImageReader:
Et voilà :o) |
Hi,
as tesseract 4.00-dev was uploaded to Debian (why is another question), gImageReader was rebuild against it. Now the recognition speed is about 10 times (!) lower than with tesseract 3.05. Using the command line tesseract does not show this slow speed.
Do you have any idea what goes wrong here? I'll happily test patches :-)
Best,
Philip
The text was updated successfully, but these errors were encountered: