Disable OpenMP #259

bertsky · 2021-06-06T16:57:19Z

By default tesstrain builds vanilla tesseract / lstmtraining, which IINM links against OpenMP.

I know @stweil argued repeatedly for disabling OpenMP for prediction in the mass production / batch scenario, e.g. here.

However, the case for training seems to be a different one: normally we want to get a single model on a lot of data as fast as possible.

In this comment @theraysmith presented measurements showing 3.5x speedup with 4 threads.

However, I cannot reproduce that. On the contrary: On a VM with (4 cores of) Intel Xeon Gold 5218 CPU @ 2.30GHz and a finetuning job with 500 lines, I see an 8x increase, despite using more than 300% CPU instead of just 100% when single-threaded. (Yes, I do get similar results in both cases.) That's a 24x worse utilization of operational resources!

(I have repeated that experiment on a finetuning job with 1200 lines, where multithreaded takes 2x as long.)

Also, there seems to be a significant difference between lstmtraining built with --disable-openmp on the one side and lstmtraining built with OpenMP, but run with OMP_THREAD_LIMIT=1: the latter is even worse than OpenMP running unconstrained. Also, it still takes more than one core (alternating between 100% CPU utilization mostly and 200% intermittently).

Can anyone confirm this? Has something changed significantly in Tesseract's threaded code base since Ray's time, or is this simply due to my virtualization environment?

Moreover, @stweil has already pointed out OpenMP prevents reproducible models. For all the effort that has already gone into the latter (replacing shuf etc, sorting files), why is OpenMP still in then?

Why don't we build with --disable-openmp by default, or at least set OMP_THREAD_LIMIT=1 in the lstmtraining recipes?

The text was updated successfully, but these errors were encountered:

stweil · 2021-06-06T17:11:35Z

Can anyone confirm this?

Yes, I get similar results: OpenMP uses a lot of resources and increases the time required for training.

Note that training currently requires up to 4 threads even when Tesseract was built without OpenMP:

training (running continuously)
asynchronous loading of lstmf files for training (one thread per file, only during first epoch until all files are in memory)
evaluation (running temporarily)
asynchronous loading of lstmf files for evaluation (one thread per file, only until all files are in memory)

stweil · 2021-06-06T17:18:57Z

Why don't we build with --disable-openmp by default?

Because nobody changed that? I wonder whether we should remove all build instructions from the Makefile.

zdenop · 2021-06-07T17:27:31Z

cmake build disabled openmp by default for a long time...

bertsky · 2021-06-07T19:07:08Z

Note that training currently requires up to 4 threads even when Tesseract was built without OpenMP:

training (running continuously)

asynchronous loading of lstmf files for training (one thread per file, only during first epoch until all files are in memory)

evaluation (running temporarily)

asynchronous loading of lstmf files for evaluation (one thread per file, only until all files are in memory)

Interesting! So that's why I get occasional 200% CPU utilization.

Because nobody changed that? I wonder whether we should remove all build instructions from the Makefile.

I see. Yes, probably better to have all that in Tesseract's makefile only.

I'm not autotools afluent, so how do we change this?

https://github.com/tesseract-ocr/tesseract/blob/7a308edcb1fc7455008b531bc2a49de583d7b171/configure.ac#L227

cmake build disabled openmp by default for a long time...

Indeed, git annotate says this was from 2yrs ago, so I guess the respective statement in the release notes Jul 07 2019 - V4.1.0, stating OpenMP has been disabled by default, only applied to that build variant. Let's have autoconf catch up then!

stale · 2021-07-08T02:06:04Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

amitdo · 2021-12-26T23:15:46Z

so I guess the respective statement in the release notes Jul 07 2019 - V4.1.0, stating OpenMP has been disabled by default, only applied to that build variant.

I updated the 4.1.0 release notes.

Disable OpenMP support by default. This was done in the the CMake build, but not in the Autotools build, where the OpenMP is still enabled by default

stweil · 2021-12-30T17:39:15Z

The Autotools build in release 5.0.0 still enabled OpenMP by default. Should that be changed? Is this a "bug fix" which can be done in release 5.0.1 (and maybe also in release 4.1.4)?

stale bot added the stale Issues which require input by the reporter which is not provided label Jul 8, 2021

stale bot closed this as completed Jul 15, 2021

bertsky reopened this Jul 15, 2021

stale bot removed the stale Issues which require input by the reporter which is not provided label Jul 15, 2021

wrznr added the pinned Eternal issues which are save from becoming stale label Jul 15, 2021

amitdo mentioned this issue Jan 7, 2022

cache performance and inline functions tesseract-ocr/tesseract#1035

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Disable OpenMP #259

Disable OpenMP #259

bertsky commented Jun 6, 2021

stweil commented Jun 6, 2021

stweil commented Jun 6, 2021

zdenop commented Jun 7, 2021

bertsky commented Jun 7, 2021

stale bot commented Jul 8, 2021

amitdo commented Dec 26, 2021

stweil commented Dec 30, 2021 •

edited

Loading

Disable OpenMP #259

Disable OpenMP #259

Comments

bertsky commented Jun 6, 2021

stweil commented Jun 6, 2021

stweil commented Jun 6, 2021

zdenop commented Jun 7, 2021

bertsky commented Jun 7, 2021

stale bot commented Jul 8, 2021

amitdo commented Dec 26, 2021

stweil commented Dec 30, 2021 • edited Loading

stweil commented Dec 30, 2021 •

edited

Loading