RFC: Tesseract performance #943

stweil · 2017-05-22T08:38:06Z

Performance is important for real time OCR, mass production OCR and training.

In this RFC I'd like to discuss performance bottlenecks and potential improvements.

According to my tests with Valgrind's tool callgrind, these functions have the largest computational costs (ordered by decreasing time):

memory allocations / deallocations
tesseract::UnicharIdArrayUtils::compare
tesseract::DotProductAVX (or the other implementations of the dot product)
vfprintf (called from snprintf)

tesseract::UnicharIdArrayUtils::compare and memory allocations / deallocations are also the functions which are called most often.

I recently had a closer look at the dot product calculations and noticed that at least some input vectors are converted from float to double (which takes time). The dot product is always done with double values (more expensive than float). If memory bandwidth is the limiting factor, using double means doubled time compared with float. The current code uses 4 parallel threads. I have run some timing tests without that parallelization and got nearly the same execution time. @theraysmith, did you try using float for the dot product, and do you get better performance from parallelization in that part of the OCR process?

The text was updated successfully, but these errors were encountered:

theraysmith · 2017-05-22T22:42:48Z

Yes I've done quite a lot of performance vs accuracy testing. Memory new/delete: The network system uses a rather complex scratchpad mechanism to minimize the number of allocations/deallocations. It helped a lot with both speed and peak memory. I'd be curious to know where the current bottleneck is is you have specifics of the stack where the allocations take place. UnicharIdArrayUtils: Hmm. Looks like it may be doing too many id_to_unichar (). Again callers/stack traces would be useful. snprintf: Really? where from? Are you running with some debug flags on? DotProduct: Float vs double: - Running a large dot product strictly in float (ie float += (float = float * float)) is an unwise thing to do. (Ahem, no comment on any other nn libraries that do that.) - I found significant accuracy impacts using float in the multiply-add accumulation, and neither SSE nor AVX provide a double = float x float operation, (analogous to the 32 = 16x16 integer instruction in SSE) which is what is really needed. - The SSE/AVX float->double cast is extremely slow - slower than reading a double from memory. The code is therefore optimized to minimize the memory bandwidth, and the number of float->double conversions, while using double += double*double in the AVX code. (Since writing the AVX dot product, I thought of a better way of doing it, but haven't had time to implement that yet.) In any case, the time savings are small, a factor of <2. A good integer implementation may squeeze better out of it, but I haven't seen it yet.OTOH, I haven't tried it lately. AVX2 and AVX512 extend the integer operations beyond the ones available on SSE (not on base AVX) and may make it worth it, when I get machine with AVX512. The additional benefit of (8 bit) integer is that it reduces the size of everything, making it more likely that data will stay in a higher-level cache. *Far greater performance improvements can be made by making the network smaller.* As I already indicated, I have had some very good results in this area, with a network 3x faster than the legacy code (for English) and *much* faster than the legacy code for complex scripts. I since messed up the synthetic data pipeline and training trying to improve Indic, but I have major improvements in some other languages, so when I get back to the best results for everything, I think you'll like it. BTW I get a significant speed-up in walltime with the open MP code running. (Better than 2x) Did you compile and link it correctly with OpenMP? I have noticed that the CPU profile with OpenMp running is of little practical use.

…

On Mon, May 22, 2017 at 1:38 AM, Stefan Weil ***@***.***> wrote: Performance is important for real time OCR, mass production OCR and training. In this RFC I'd like to discuss performance bottlenecks and potential improvements. See also the Tesseract wiki <https://github.com/tesseract-ocr/tesseract/wiki/4.0-Accuracy-and-Performance> . According to my tests with Valgrind's tool callgrind, these functions have the largest computational costs (ordered by decreasing time): - memory allocations / deallocations - tesseract::UnicharIdArrayUtils::compare - tesseract::DotProductAVX (or the other implementations of the dot product) - vfprintf (called from snprintf) tesseract::UnicharIdArrayUtils::compare and memory allocations / deallocations are also the functions which are called most often. I recently had a closer look at the dot product calculations and noticed that at least some input vectors are converted from float to double (which takes time). The dot product is always done with double values (more expensive than float). If memory bandwidth is the limiting factor, using double means doubled time compared with float. The current code uses 4 parallel threads. I have run some timing tests without that parallelization and got nearly the same execution time. @theraysmith <https://github.com/theraysmith>, did you try using float for the dot product, and do you get better performance from parallelization in that part of the OCR process? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#943>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AL056TVhtIV-0-2TldC4EuvYMqx71Tb7ks5r8Ul6gaJpZM4NiCwj> .

-- Ray.

stweil · 2017-05-23T04:28:25Z

Valgrind shows all callers, so I can provide that information (next week, as I'm currently busy with other things).

Regarding precision of the dot product: the addition is the critical part for the accuracy. Did you ever try some of the algorithms which help to improve that part, e. g. Kahan? Maybe that would be sufficient to allow using float everywhere.

I used OpenMP and disabled it only in lstm/fullyconnected.cpp and in lstm/weightmatrix.cpp. The original code of the master branch shows that OpenMP works, as the real time is much less than the user time:

real    2m50,958s
user    7m39,712s
sys     0m2,128s

With OpenMP disabled for the parts mentioned above, OpenMP still works, the real time increases moderately while the user time decreases:

real    3m9,378s
user    7m33,084s
sys     0m2,092s

stweil · 2017-05-25T11:10:21Z

Stack for tesseract::UnicharIdArrayUtils::compare (called 2651872 times for small hello world image):

tesseract::UnicharIdArrayUtils::compare
ELIST::add_sorted_and_find
ELIST::add_sorted
tesseract::UnicharAmbigs::InsertIntoTable
tesseract::UnicharAmbigs::LoadUnicharAmbigs
...

Shreeshrii · 2017-05-25T12:52:04Z

Is this for --oem 0 or --oem 1 ?

I thought that Unicharambigs is not used by LSTM engine...

stweil · 2017-05-25T12:59:23Z

It is for LSTM:

valgrind --tool=callgrind --dump-line=yes --verbose api/tesseract --oem 1 hello.png /tmp/hello

The output is a large file called something like callgrind.out.1234 (replace 1234 by the process id).
I suggest using kcachegrind to see the results.

Shreeshrii · 2017-05-25T17:54:12Z

@stweil

Is that with debug? What about without it?

stweil · 2017-05-25T18:21:06Z

No, it was built with ./configure --disable-shared --disable-static 'CXXFLAGS=-Wall -g -O2'. I usually disable the library builds because they take additional time.

stweil · 2017-05-25T21:43:34Z

My first results reported above were from a very small image (single line hello world), so initialization contributes significantly.

With a really large image (newspaper), the result changes and tesseract::DotProductAVX is the dominating element. Surprisingly it is followed by gomp_team_barrier_wait_end and gomp_barrier_wait_end which according to Valgrind use nearly as much time as the dot product. Those two functions are part of OpenMP.

See also issue #898.

stweil · 2017-05-26T06:41:52Z

Unrolling the loop in tesseract::DotProductAVX results in nearly 7 % improvement:

# tesseract 0604.jp2 /tmp/0604 # git master, without OpenMP
real	2m54,469s
user	2m54,160s
sys	0m0,304s

# same test, but with improved tesseract::DotProductAVX
real	2m41,855s
user	2m41,576s
sys	0m0,272s

stweil · 2017-05-26T09:38:42Z

Latest numbers based on code from FAU Erlangen (thanks to @moebiusband73), Kahan part removed:

real	2m31,514s
user	2m31,220s
sys	0m0,280s

That is an improvement of 12 %. Using assembler code could improve further, but I'd expect the largest improvement from using float instead of double (trying to compensate the loss of precision by using the Kahan algorithm).

mlissner · 2017-05-30T19:09:38Z

The conversation here is largely over my head, but I came to the bug tracker to discuss performance in 4.0, and this bug is titled "RFC: Tesseract Performance" so it seems like the right place. (Apologies if I'm wrong.)

Simple question: On the "Neural nets In Tesseract" wiki page, it says:

On a machine with multiple cores, and AVX, an easy English image may take twice as much real time, and use 7 times the CPU as base Tesseract

But above (and elsewhere) says:

I have had some very good results in this area, with a network 3x faster than the legacy code (for English)

I do a lot of batch OCR using Tesseract. Can I expect 4.0 to be faster by 3× or slower by 7×? If the latter, that'll be something that we'll need to plan on, since our current infrastructure took over a month to complete the last batch OCR job using 3.02. If we need 7× more servers that's a huge deal — I don't know that we'd ever upgrade if that was the case. If it's 3× faster, that's incredible.

(Sorry again if this is the wrong place to bring this up. I'm trying to get a grasp on this situation. Thank you all for all your great work.)

stweil · 2017-05-30T20:06:11Z

The current 4.0 still supports the "old" OCR recognizer and is comparable to 3.05 if that is used (command line argument --oem 0), but there are plans to remove this feature from 4.0.

4.0 supports a new recognizer based on LSTM (--oem 1). The new code for LSTM uses parallel processing, so OCR can finish faster, but need more total CPU time. As 4.0 is still experimental, all timing results can change. Ray announced new trained language models for 4.0 with a smaller neural network. If that works, it could reduce CPU‌ time in comparison to 3.05.

Currently you won't reduce the infrastructure requirements with 4.0.

mlissner · 2017-05-30T20:54:40Z

That sounds promising, thanks @stweil. I'm perfectly happy not reducing infrastructure requirements, but increasing them 7× would be a very big deal.

FWIW, decreasing wall time via parallelization while increasing CPU time sounds good on paper, but unless I'm missing something (totally possible), it doesn't mean much for batch jobs like ours since we run 24 OCR processes in parallel already. Our server looks like this when doing a job:

stweil · 2017-05-31T03:57:03Z

For your use case (which is similar to my own) I'd compile Tesseract without OpenMP support. Otherwise the parallelization will take a significant part of the CPU performance.

amitdo · 2017-05-31T08:15:08Z

What's the method you use to disable OpenMP?

Commenting AC_OPENMP or something else?

stweil · 2017-05-31T09:25:10Z

configure --disable-openmp

amitdo · 2017-06-01T07:11:13Z

--disable-shared --disable-static seems to be equivalent to just --disable-shared.

Shreeshrii · 2017-06-01T10:58:07Z

Please see #961 (comment)

for another example of decrease in user time with --disable-openmp.

By disabling openmp, the user time has become almost one third on WSL on my AMD Windows 10 PC.

amitdo · 2019-07-12T10:01:27Z

Regarding precision of the dot product: the addition is the critical part for the accuracy. Did you ever try some of the algorithms which help to improve that part, e. g. Kahan? Maybe that would be sufficient to allow using float everywhere.

With some of NVIDIA's cards it's now possible to run the dotproduct on fp16 vectors.
sum_fp32 += u_fp16 * v_fp16

Google's TPUs also have something similar, with bfloat16 instead of the standard fp16.

Mark-Joy · 2023-11-07T07:17:27Z

Simply put, the LSTM model (oem=1) only produce faster execution time than legacy model (oem=0) if the number of images is less than the (machine maximum threads/4)

Example:
For an ordinary computer with maximum number of threads is 24, LSTM model gives faster execution time when processing a pdf with less than or equal to 6 pages (assuming 1 page is 1 image)

Mark-Joy · 2023-11-07T07:38:40Z

The current 4.0 still supports the "old" OCR recognizer and is comparable to 3.05 if that is used (command line argument --oem 0), but there are plans to remove this feature from 4.0.

Please don't remove this feature. The LSTM model needs a large number of threads(24 threads) just to process a small pdf(6 pages). Currently, the legacy model is the best solution for this job.
Please poll us, the users - before remove this important feature.

btw, what is the usage scenario for tesseract 4 and 5 without the legacy model, if it turns out to be significant more time consuming in mass processing?

stweil · 2023-11-07T09:48:31Z

@Mark-Joy, that's only true if Tesseract is running with OpenMP multithreading enabled. I always suggest to disable multithreading because OpenMP wastes a lot of CPU resources. Then LSTM is still faster than legacy mode although it uses only a single CPU thread per page. On our OCR servers mass processing runs with 48 or 64 CPU threads which process the same number of page images simultaneously. And of course they use our self-trained LSTM models, not legacy models.

Mark-Joy · 2023-11-07T16:56:08Z

@stweil, Thank you for the information.
With OpenMP disabled, may I ask which traineddata model can be used to get LTSM model faster than legacy one?

On my machine, only fast data model(oem=1) gives 10-15% faster than legacy, but it give unsatisfied results regarding text recognition. 6-7 out of 10 I would pick legacy result whenever there's a difference.

With traineddata that supports both legacy(oem=0) and LSTM(oem=1), LSTM gives 35% slower execution time.

The worst ones are the best data models(oem=1), LSTM gives 4-5 times(400%-500%) slower than legacy

stweil mentioned this issue Jun 1, 2017

--disable-openmp improves performance? #961

Closed

amitdo mentioned this issue Jul 19, 2017

Tesseract 4 cannot use anything other than --oem 0 #1043

Closed

This was referenced Jan 18, 2018

Significant speed drop on Tesseract4 vs 3 with identical image #1278

Closed

TODO: AVX512 #1266

Closed

amitdo mentioned this issue Mar 20, 2018

fast vs. best #1404

Closed

amitdo mentioned this issue Jul 5, 2018

../nptl/pthread_mutex_lock.c: No such file or directory. #1739

Closed

Shreeshrii mentioned this issue Aug 11, 2019

Two tesseract commands at the same time blocks each other from finishing #2609

Closed

amitdo added performance OpenMP labels May 14, 2020

amitdo mentioned this issue May 15, 2020

NEON SIMD code. #2978

Merged

amitdo added the SIMD label May 15, 2020

amitdo added the RFC label Mar 21, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: Tesseract performance #943

RFC: Tesseract performance #943

stweil commented May 22, 2017

theraysmith commented May 22, 2017 via email

stweil commented May 23, 2017 •

edited

Loading

stweil commented May 25, 2017

Shreeshrii commented May 25, 2017

stweil commented May 25, 2017 •

edited

Loading

Shreeshrii commented May 25, 2017

stweil commented May 25, 2017

stweil commented May 25, 2017 •

edited

Loading

stweil commented May 26, 2017 •

edited

Loading

stweil commented May 26, 2017 •

edited

Loading

mlissner commented May 30, 2017

stweil commented May 30, 2017

mlissner commented May 30, 2017

stweil commented May 31, 2017

amitdo commented May 31, 2017

stweil commented May 31, 2017

amitdo commented Jun 1, 2017

Shreeshrii commented Jun 1, 2017

amitdo commented Jul 12, 2019 •

edited

Loading

Mark-Joy commented Nov 7, 2023 •

edited

Loading

Mark-Joy commented Nov 7, 2023

stweil commented Nov 7, 2023 •

edited

Loading

Mark-Joy commented Nov 7, 2023

RFC: Tesseract performance #943

RFC: Tesseract performance #943

Comments

stweil commented May 22, 2017

theraysmith commented May 22, 2017 via email

stweil commented May 23, 2017 • edited Loading

stweil commented May 25, 2017

Shreeshrii commented May 25, 2017

stweil commented May 25, 2017 • edited Loading

Shreeshrii commented May 25, 2017

stweil commented May 25, 2017

stweil commented May 25, 2017 • edited Loading

stweil commented May 26, 2017 • edited Loading

stweil commented May 26, 2017 • edited Loading

mlissner commented May 30, 2017

stweil commented May 30, 2017

mlissner commented May 30, 2017

stweil commented May 31, 2017

amitdo commented May 31, 2017

stweil commented May 31, 2017

amitdo commented Jun 1, 2017

Shreeshrii commented Jun 1, 2017

amitdo commented Jul 12, 2019 • edited Loading

Mark-Joy commented Nov 7, 2023 • edited Loading

Mark-Joy commented Nov 7, 2023

stweil commented Nov 7, 2023 • edited Loading

Mark-Joy commented Nov 7, 2023

stweil commented May 23, 2017 •

edited

Loading

stweil commented May 25, 2017 •

edited

Loading

stweil commented May 25, 2017 •

edited

Loading

stweil commented May 26, 2017 •

edited

Loading

stweil commented May 26, 2017 •

edited

Loading

amitdo commented Jul 12, 2019 •

edited

Loading

Mark-Joy commented Nov 7, 2023 •

edited

Loading

stweil commented Nov 7, 2023 •

edited

Loading