-
Notifications
You must be signed in to change notification settings - Fork 9.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC: Tesseract performance #943
Comments
Yes I've done quite a lot of performance vs accuracy testing.
Memory new/delete:
The network system uses a rather complex scratchpad mechanism to minimize
the number of allocations/deallocations. It helped a lot with both speed
and peak memory.
I'd be curious to know where the current bottleneck is is you have
specifics of the stack where the allocations take place.
UnicharIdArrayUtils:
Hmm. Looks like it may be doing too many id_to_unichar (). Again
callers/stack traces would be useful.
snprintf:
Really? where from? Are you running with some debug flags on?
DotProduct:
Float vs double:
- Running a large dot product strictly in float (ie float += (float =
float * float)) is an unwise thing to do. (Ahem, no comment on any other nn
libraries that do that.)
- I found significant accuracy impacts using float in the multiply-add
accumulation, and neither SSE nor AVX provide a double = float x float
operation, (analogous to the 32 = 16x16 integer instruction in SSE) which
is what is really needed.
- The SSE/AVX float->double cast is extremely slow - slower than reading
a double from memory.
The code is therefore optimized to minimize the memory bandwidth, and the
number of float->double conversions, while using double += double*double in
the AVX code.
(Since writing the AVX dot product, I thought of a better way of doing it,
but haven't had time to implement that yet.)
In any case, the time savings are small, a factor of <2.
A good integer implementation may squeeze better out of it, but I haven't
seen it yet.OTOH, I haven't tried it lately.
AVX2 and AVX512 extend the integer operations beyond the ones available on
SSE (not on base AVX) and may make it worth it, when I get machine with
AVX512.
The additional benefit of (8 bit) integer is that it reduces the size of
everything, making it more likely that data will stay in a higher-level
cache.
*Far greater performance improvements can be made by making the network
smaller.*
As I already indicated, I have had some very good results in this area,
with a network 3x faster than the legacy code (for English) and *much*
faster than the legacy code for complex scripts.
I since messed up the synthetic data pipeline and training trying to
improve Indic, but I have major improvements in some other languages, so
when I get back to the best results for everything, I think you'll like it.
BTW I get a significant speed-up in walltime with the open MP code running.
(Better than 2x) Did you compile and link it correctly with OpenMP?
I have noticed that the CPU profile with OpenMp running is of little
practical use.
…On Mon, May 22, 2017 at 1:38 AM, Stefan Weil ***@***.***> wrote:
Performance is important for real time OCR, mass production OCR and
training.
In this RFC I'd like to discuss performance bottlenecks and potential
improvements.
See also the Tesseract wiki
<https://github.com/tesseract-ocr/tesseract/wiki/4.0-Accuracy-and-Performance>
.
According to my tests with Valgrind's tool callgrind, these functions
have the largest computational costs (ordered by decreasing time):
- memory allocations / deallocations
- tesseract::UnicharIdArrayUtils::compare
- tesseract::DotProductAVX (or the other implementations of the dot
product)
- vfprintf (called from snprintf)
tesseract::UnicharIdArrayUtils::compare and memory allocations /
deallocations are also the functions which are called most often.
I recently had a closer look at the dot product calculations and noticed
that at least some input vectors are converted from float to double
(which takes time). The dot product is always done with double values
(more expensive than float). If memory bandwidth is the limiting factor,
using double means doubled time compared with float. The current code
uses 4 parallel threads. I have run some timing tests without that
parallelization and got nearly the same execution time. @theraysmith
<https://github.com/theraysmith>, did you try using float for the dot
product, and do you get better performance from parallelization in that
part of the OCR process?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#943>, or mute the
thread
<https://github.com/notifications/unsubscribe-auth/AL056TVhtIV-0-2TldC4EuvYMqx71Tb7ks5r8Ul6gaJpZM4NiCwj>
.
--
Ray.
|
Valgrind shows all callers, so I can provide that information (next week, as I'm currently busy with other things). Regarding precision of the dot product: the addition is the critical part for the accuracy. Did you ever try some of the algorithms which help to improve that part, e. g. Kahan? Maybe that would be sufficient to allow using I used OpenMP and disabled it only in
With OpenMP disabled for the parts mentioned above, OpenMP still works, the real time increases moderately while the user time decreases:
|
Stack for tesseract::UnicharIdArrayUtils::compare (called 2651872 times for small hello world image):
|
Is this for --oem 0 or --oem 1 ? I thought that Unicharambigs is not used by LSTM engine... |
It is for LSTM:
The output is a large file called something like |
Is that with |
No, it was built with |
My first results reported above were from a very small image (single line hello world), so initialization contributes significantly. With a really large image (newspaper), the result changes and See also issue #898. |
Unrolling the loop in
|
Latest numbers based on code from FAU Erlangen (thanks to @moebiusband73), Kahan part removed:
That is an improvement of 12 %. Using assembler code could improve further, but I'd expect the largest improvement from using |
The conversation here is largely over my head, but I came to the bug tracker to discuss performance in 4.0, and this bug is titled "RFC: Tesseract Performance" so it seems like the right place. (Apologies if I'm wrong.) Simple question: On the "Neural nets In Tesseract" wiki page, it says:
But above (and elsewhere) says:
I do a lot of batch OCR using Tesseract. Can I expect 4.0 to be faster by 3× or slower by 7×? If the latter, that'll be something that we'll need to plan on, since our current infrastructure took over a month to complete the last batch OCR job using 3.02. If we need 7× more servers that's a huge deal — I don't know that we'd ever upgrade if that was the case. If it's 3× faster, that's incredible. (Sorry again if this is the wrong place to bring this up. I'm trying to get a grasp on this situation. Thank you all for all your great work.) |
The current 4.0 still supports the "old" OCR recognizer and is comparable to 3.05 if that is used (command line argument 4.0 supports a new recognizer based on LSTM ( Currently you won't reduce the infrastructure requirements with 4.0. |
That sounds promising, thanks @stweil. I'm perfectly happy not reducing infrastructure requirements, but increasing them 7× would be a very big deal. FWIW, decreasing wall time via parallelization while increasing CPU time sounds good on paper, but unless I'm missing something (totally possible), it doesn't mean much for batch jobs like ours since we run 24 OCR processes in parallel already. Our server looks like this when doing a job: |
For your use case (which is similar to my own) I'd compile Tesseract without OpenMP support. Otherwise the parallelization will take a significant part of the CPU performance. |
What's the method you use to disable OpenMP? Commenting |
|
|
Please see #961 (comment) for another example of decrease in user time with --disable-openmp.
|
With some of NVIDIA's cards it's now possible to run the dotproduct on fp16 vectors. Google's TPUs also have something similar, with bfloat16 instead of the standard fp16. |
Simply put, the LSTM model (oem=1) only produce faster execution time than legacy model (oem=0) if the number of images is less than the (machine maximum threads/4) Example: |
Please don't remove this feature. The LSTM model needs a large number of threads(24 threads) just to process a small pdf(6 pages). Currently, the legacy model is the best solution for this job. btw, what is the usage scenario for tesseract 4 and 5 without the legacy model, if it turns out to be significant more time consuming in mass processing? |
@Mark-Joy, that's only true if Tesseract is running with OpenMP multithreading enabled. I always suggest to disable multithreading because OpenMP wastes a lot of CPU resources. Then LSTM is still faster than legacy mode although it uses only a single CPU thread per page. On our OCR servers mass processing runs with 48 or 64 CPU threads which process the same number of page images simultaneously. And of course they use our self-trained LSTM models, not legacy models. |
@stweil, Thank you for the information. On my machine, only With traineddata that supports both The worst ones are the |
Performance is important for real time OCR, mass production OCR and training.
In this RFC I'd like to discuss performance bottlenecks and potential improvements.
See also the Tesseract wiki.
According to my tests with Valgrind's tool
callgrind
, these functions have the largest computational costs (ordered by decreasing time):tesseract::UnicharIdArrayUtils::compare
tesseract::DotProductAVX
(or the other implementations of the dot product)vfprintf
(called fromsnprintf
)tesseract::UnicharIdArrayUtils::compare
and memory allocations / deallocations are also the functions which are called most often.I recently had a closer look at the dot product calculations and noticed that at least some input vectors are converted from
float
todouble
(which takes time). The dot product is always done withdouble
values (more expensive thanfloat
). If memory bandwidth is the limiting factor, usingdouble
means doubled time compared withfloat
. The current code uses 4 parallel threads. I have run some timing tests without that parallelization and got nearly the same execution time. @theraysmith, did you try usingfloat
for the dot product, and do you get better performance from parallelization in that part of the OCR process?The text was updated successfully, but these errors were encountered: