Multiprocess 4.00.00alpha way slower than 3.03 #898

a455bcd9 · 2017-05-09T20:42:46Z

Hi,

I need to do OCR on a lot of multipage TIF documents. After reading #263 (comment) I decided to run several Tesseract processes in parallel.

With tesseract 3.03, OCR speeds increases linearly (more or less) with the number of processes. However, with 4.00.00alpha all processes are blocked at the first page and it seems to take an infinitely long time to process this first page. If I manually pause a process, others are able to resume processing.

The problems seems to be caused by the fact that v4.00 uses up to 4 CPUs to process a multipage TIF (one is saturated and the other 3 are used at about 25%). So if you run 4 processes in parallel on a 4-CPU machine, they're stuck. That's also why launching two processes in parallel on an 8-CPU machine is OK but launching 8 is infinitely slow.

I got the same problem on Ubuntu 14.04.5 LTS and Amazon Linux AMI 2016.09.

Is it a bug on the alpha version? Or is it a feature meant to fasten the processing of multipage TIFF images?

Thanks for any help you can provide.

tesseract 3.05.00 ( 2ca5d0a ) is OK

The text was updated successfully, but these errors were encountered:

stweil · 2017-05-09T21:00:07Z

The behavior which you describe was expected, see this previous discussion.

You can build Tesseract 4.x without multithreading by using configure --disable-openmp. That will improve your case, but I expect that it still will be slower than 3.x because Tesseract 4.x needs more processing time.

It would be good to have a runtime option to disable multithreading (or set the number of threads).

a455bcd9 · 2017-05-09T21:21:34Z

Thanks a lot, it worked! I'll benchmark 3.x vs 4.x in my case to see if it's interesting to use 4.x.

I don't need multiprocessing all the time, so is it possible to build 4.x with multi threading and to specify a flag to disable OpenMP just before running Tesseract?

amitdo · 2017-05-09T21:39:04Z

is it possible to build 4.x with multi threading and to specify a flag to disable OpenMP just before running Tesseract?

No.

It would be good to have a runtime option to disable multithreading (or set the number of threads).

stweil · 2017-05-10T06:24:57Z

is it possible to build 4.x with multi threading and to specify a flag to disable OpenMP just before running Tesseract?

"No" is the correct answer, but the whole story is a little bit more complicated. Here is the related Tesseract code:

ccmain/par_control.cpp:#pragma omp parallel for num_threads(10)
lstm/fullyconnected.cpp:#pragma omp parallel for num_threads(kNumThreads)
lstm/fullyconnected.cpp:#pragma omp parallel for num_threads(kNumThreads)
lstm/lstm.cpp:#pragma omp parallel for num_threads(GFS) if (!Is2D())
lstm/parallel.cpp:#pragma omp parallel for num_threads(stack_size)
lstm/parallel.cpp:#pragma omp parallel for num_threads(stack_size)
lstm/weightmatrix.cpp:#pragma omp parallel for num_threads(4) if (in_parallel)

Some of those statements use a fixed number of threads (10, kNumThreads = 4, 4), while others use a calculated value. In addition, there is code which generates the threads conditionally. There is also a Tesseract parameter named tessedit_parallelize which controls use of multithreading. By default it is set to 0 which means no multithreading for those parts of the code. So the more complete answer would be: No, you cannot disable OpenMP just before running Tesseract, but you can enable additional use of OpenMP by setting the parameter tessedit_parallelize.

amitdo · 2017-05-10T10:54:26Z

The parameter tessedit_parallelize is used only with the legacy engine*. The new LSTM engine does not use it.

* Ray now calls the legacy engine "dead code".

a455bcd9 · 2017-05-10T13:58:24Z

Thanks, I close this issue.

stweil · 2017-05-10T14:41:17Z

@a455bcd9, it would be nice if you could publish your final benchmark results here as soon as they are available.

a455bcd9 · 2017-05-10T16:59:28Z

@stweil OK!

By the way, I thought OMP_NUM_THREADS=1 tesseract ... would disable multi threading but it seems it doesn't change anything, is it normal?

stweil · 2017-05-10T17:09:43Z

OMP_NUM_THREADS specifies the default number of threads. The Tesseract code never uses that default because all omp parallel statements add the num_threads attribute.

stweil · 2017-05-25T21:11:06Z

In the mean time I did compare Tesseract 4 with and without OpenMP. My test result suggests that mass production should not use OpenMP:

# tesseract 0604.jp2 /tmp/0604 # default = with OpenMP
real	1m44,390s
user	4m57,656s
sys	0m1,352s

# tesseract 0604.jp2 /tmp/0604 # without OpenMP
real	2m54,469s
user	2m54,160s
sys	0m0,304s

While the total time is shorter with multithreaded code, the user time is much worse.
Therefore I'd expect that it is better to run large OCR jobs with one non threaded
Tesseract process per CPU.

amitdo · 2017-07-13T21:05:26Z

@a455bcd9,

is it possible to build 4.x with multi threading and to specify a flag to disable OpenMP just before running Tesseract?

Actually, the answer is yes :-)

OMP_THREAD_LIMIT=1 tesseract...

tesseract-ocr/tesseract#898

sirius0503 · 2019-09-23T07:18:13Z

In the mean time I did compare Tesseract 4 with and without OpenMP. My test result suggests that mass production should not use OpenMP:
# tesseract 0604.jp2 /tmp/0604 # default = with OpenMP
real	1m44,390s
user	4m57,656s
sys	0m1,352s

# tesseract 0604.jp2 /tmp/0604 # without OpenMP
real	2m54,469s
user	2m54,160s
sys	0m0,304s
While the total time is shorter with multithreaded code, the user time is much worse.
Therefore I'd expect that it is better to run large OCR jobs with one non threaded
Tesseract process per CPU.

@stweil Can you elaborate upon this

stweil · 2019-09-23T08:49:19Z

Simply don't use Tesseract 4 with OpenMP unless you are sure that it helps in your case.

sirius0503 · 2019-09-23T09:22:47Z

@stweil : Using OMP_THREAD_LIMIT = 1 seems to be the solution as given by @amitdo

tesseract-ocr/tesseract#898

Freredaran · 2023-02-18T16:57:42Z

@stweil : Using OMP_THREAD_LIMIT = 1 seems to be the solution as given by @amitdo

Same here. After updating to Ubuntu 22.04, gImageReader became incredibly slow for me. Dev manisandro was very helpful and led me to a quick and dirty cli solution for running on a single thread. 'Works wonderfully for me.
In a terminal, type:

`export OMP_THREAD_LIMIT=1`

If you want to check that you actually are running on one thread, type:

`echo $OMP_THREAD_LIMIT`

Then run gImageReader:

`gimagereader-gtk`

Et voilà :o)

a455bcd9 closed this as completed May 10, 2017

stweil mentioned this issue May 25, 2017

RFC: Tesseract performance #943

Open

amitdo mentioned this issue Jul 3, 2017

Tesseract Multiple-Threading Issue #1019

Closed

amitdo mentioned this issue Jul 19, 2017

Tesseract 4 cannot use anything other than --oem 0 #1043

Closed

This was referenced Nov 14, 2017

[Enhancement]: enable parallel calls to the tesseract CLI openpaperwork/pyocr#83

Closed

[Enhancement]: enable parallel calls to tesseract cli Sqooba/pyocr#1

Merged

This was referenced Jan 11, 2018

Slow OCR with tesseract 4.00alpha manisandro/gImageReader#285

Closed

Significant speed drop on Tesseract4 vs 3 with identical image #1278

Closed

raffopazzo mentioned this issue Jun 11, 2018

good accuracy but too slow, how to improve Tesseract speed #263

Closed

adamhooper added a commit to overview/pdfocr that referenced this issue Jan 25, 2019

OMP_THREAD_LIMIT=1

0f3f597

tesseract-ocr/tesseract#898

ashipunov mentioned this issue Feb 4, 2019

Multiple jobs do not work with Tesseract 4 jwilk-archive/ocrodjvu#31

Open

ripefig mentioned this issue Aug 12, 2019

Tesseract 4 and 5 is about 100-150 times slower than 3 on my Linux system. #2611

Closed

amitdo added the OpenMP label May 14, 2020

bertsky mentioned this issue Jun 6, 2021

Disable OpenMP tesseract-ocr/tesstrain#259

Open

DanOlson added a commit to Minitex/mdl_search that referenced this issue Nov 5, 2021

Tune Tesseract & Sidekiq to (hopefully) prevent locks

0c5a780

tesseract-ocr/tesseract#898

DanOlson added a commit to Minitex/mdl_search that referenced this issue Dec 19, 2021

Tune Tesseract & Sidekiq to (hopefully) prevent locks

96df0f9

tesseract-ocr/tesseract#898

XueSheng-GIT mentioned this issue Aug 23, 2023

High CPU usage on multicore nextcloud/files_fulltextsearch_tesseract#61

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multiprocess 4.00.00alpha way slower than 3.03 #898

Multiprocess 4.00.00alpha way slower than 3.03 #898

a455bcd9 commented May 9, 2017 •

edited

Loading

stweil commented May 9, 2017 •

edited

Loading

a455bcd9 commented May 9, 2017 •

edited

Loading

amitdo commented May 9, 2017

stweil commented May 10, 2017

amitdo commented May 10, 2017 •

edited

Loading

a455bcd9 commented May 10, 2017

stweil commented May 10, 2017

a455bcd9 commented May 10, 2017

stweil commented May 10, 2017

stweil commented May 25, 2017

amitdo commented Jul 13, 2017

sirius0503 commented Sep 23, 2019 •

edited

Loading

stweil commented Sep 23, 2019

sirius0503 commented Sep 23, 2019

Freredaran commented Feb 18, 2023 •

edited

Loading

Multiprocess 4.00.00alpha way slower than 3.03 #898

Multiprocess 4.00.00alpha way slower than 3.03 #898

Comments

a455bcd9 commented May 9, 2017 • edited Loading

stweil commented May 9, 2017 • edited Loading

a455bcd9 commented May 9, 2017 • edited Loading

amitdo commented May 9, 2017

stweil commented May 10, 2017

amitdo commented May 10, 2017 • edited Loading

a455bcd9 commented May 10, 2017

stweil commented May 10, 2017

a455bcd9 commented May 10, 2017

stweil commented May 10, 2017

stweil commented May 25, 2017

amitdo commented Jul 13, 2017

sirius0503 commented Sep 23, 2019 • edited Loading

stweil commented Sep 23, 2019

sirius0503 commented Sep 23, 2019

Freredaran commented Feb 18, 2023 • edited Loading

a455bcd9 commented May 9, 2017 •

edited

Loading

stweil commented May 9, 2017 •

edited

Loading

a455bcd9 commented May 9, 2017 •

edited

Loading

amitdo commented May 10, 2017 •

edited

Loading

sirius0503 commented Sep 23, 2019 •

edited

Loading

Freredaran commented Feb 18, 2023 •

edited

Loading