-
Notifications
You must be signed in to change notification settings - Fork 9.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multiprocess 4.00.00alpha way slower than 3.03 #898
Comments
The behavior which you describe was expected, see this previous discussion. You can build Tesseract 4.x without multithreading by using It would be good to have a runtime option to disable multithreading (or set the number of threads). |
Thanks a lot, it worked! I'll benchmark 3.x vs 4.x in my case to see if it's interesting to use 4.x. I don't need multiprocessing all the time, so is it possible to build 4.x with multi threading and to specify a flag to disable OpenMP just before running Tesseract? |
No.
|
"No" is the correct answer, but the whole story is a little bit more complicated. Here is the related Tesseract code:
Some of those statements use a fixed number of threads (10, kNumThreads = 4, 4), while others use a calculated value. In addition, there is code which generates the threads conditionally. There is also a Tesseract parameter named |
The parameter * Ray now calls the legacy engine "dead code". |
Thanks, I close this issue. |
@a455bcd9, it would be nice if you could publish your final benchmark results here as soon as they are available. |
@stweil OK! By the way, I thought |
|
In the mean time I did compare Tesseract 4 with and without OpenMP. My test result suggests that mass production should not use OpenMP:
While the total time is shorter with multithreaded code, the user time is much worse. |
Actually, the answer is yes :-)
|
@stweil Can you elaborate upon this |
Simply don't use Tesseract 4 with OpenMP unless you are sure that it helps in your case. |
Same here. After updating to Ubuntu 22.04, gImageReader became incredibly slow for me. Dev manisandro was very helpful and led me to a quick and dirty cli solution for running on a single thread. 'Works wonderfully for me.
If you want to check that you actually are running on one thread, type:
Then run gImageReader:
Et voilà :o) |
Hi,
I need to do OCR on a lot of multipage TIF documents. After reading #263 (comment) I decided to run several Tesseract processes in parallel.
With tesseract 3.03, OCR speeds increases linearly (more or less) with the number of processes. However, with 4.00.00alpha all processes are blocked at the first page and it seems to take an infinitely long time to process this first page. If I manually pause a process, others are able to resume processing.
The problems seems to be caused by the fact that v4.00 uses up to 4 CPUs to process a multipage TIF (one is saturated and the other 3 are used at about 25%). So if you run 4 processes in parallel on a 4-CPU machine, they're stuck. That's also why launching two processes in parallel on an 8-CPU machine is OK but launching 8 is infinitely slow.
I got the same problem on Ubuntu 14.04.5 LTS and Amazon Linux AMI 2016.09.
Is it a bug on the alpha version? Or is it a feature meant to fasten the processing of multipage TIFF images?
Thanks for any help you can provide.
tesseract 3.05.00 ( 2ca5d0a ) is OK
The text was updated successfully, but these errors were encountered: